eaiovnaovbqoebvqoeavibavo perlrecharclass.pod000064400000104300147633747260010446 0ustar00=head1 NAME X perlrecharclass - Perl Regular Expression Character Classes =head1 DESCRIPTION The top level documentation about Perl regular expressions is found in L. This manual page discusses the syntax and use of character classes in Perl regular expressions. A character class is a way of denoting a set of characters in such a way that one character of the set is matched. It's important to remember that: matching a character class consumes exactly one character in the source string. (The source string is the string the regular expression is matched against.) There are three types of character classes in Perl regular expressions: the dot, backslash sequences, and the form enclosed in square brackets. Keep in mind, though, that often the term "character class" is used to mean just the bracketed form. Certainly, most Perl documentation does that. =head2 The dot The dot (or period), C<.> is probably the most used, and certainly the most well-known character class. By default, a dot matches any character, except for the newline. That default can be changed to add matching the newline by using the I modifier: either for the entire regular expression with the C modifier, or locally with C<(?s)>. (The experimental C<\N> backslash sequence, described below, matches any character except newline without regard to the I modifier.) Here are some examples: "a" =~ /./ # Match "." =~ /./ # Match "" =~ /./ # No match (dot has to match a character) "\n" =~ /./ # No match (dot does not match a newline) "\n" =~ /./s # Match (global 'single line' modifier) "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) "ab" =~ /^.$/ # No match (dot matches one character) =head2 Backslash sequences X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> X<\N> X<\v> X<\V> X<\h> X<\H> X X A backslash sequence is a sequence of characters, the first one of which is a backslash. Perl ascribes special meaning to many such sequences, and some of these are character classes. That is, they match a single character each, provided that the character belongs to the specific set of characters defined by the sequence. Here's a list of the backslash sequences that are character classes. They are discussed in more detail below. (For the backslash sequences that aren't character classes, see L.) \d Match a decimal digit character. \D Match a non-decimal-digit character. \w Match a "word" character. \W Match a non-"word" character. \s Match a whitespace character. \S Match a non-whitespace character. \h Match a horizontal whitespace character. \H Match a character that isn't horizontal whitespace. \v Match a vertical whitespace character. \V Match a character that isn't vertical whitespace. \N Match a character that isn't a newline. Experimental. \pP, \p{Prop} Match a character that has the given Unicode property. \PP, \P{Prop} Match a character that doesn't have the Unicode property =head3 \N C<\N> is new in 5.12, and is experimental. It, like the dot, matches any character that is not a newline. The difference is that C<\N> is not influenced by the I regular expression modifier (see L above). Note that the form C<\N{...}> may mean something completely different. When the C<{...}> is a L, it means to match a non-newline character that many times. For example, C<\N{3}> means to match 3 non-newlines; C<\N{5,}> means to match 5 or more non-newlines. But if C<{...}> is not a legal quantifier, it is presumed to be a named character. See L for those. For example, none of C<\N{COLON}>, C<\N{4F}>, and C<\N{F4}> contain legal quantifiers, so Perl will try to find characters whose names are respectively C, C<4F>, and C. =head3 Digits C<\d> matches a single character considered to be a decimal I. If the C regular expression modifier is in effect, it matches [0-9]. Otherwise, it matches anything that is matched by C<\p{Digit}>, which includes [0-9]. (An unlikely possible exception is that under locale matching rules, the current locale might not have [0-9] matched by C<\d>, and/or might match other characters whose code point is less than 256. Such a locale definition would be in violation of the C language standard, but Perl doesn't currently assume anything in regard to this.) What this means is that unless the C modifier is in effect C<\d> not only matches the digits '0' - '9', but also Arabic, Devanagari, and digits from other languages. This may cause some confusion, and some security issues. Some digits that C<\d> matches look like some of the [0-9] ones, but have different values. For example, BENGALI DIGIT FOUR (U+09EA) looks very much like an ASCII DIGIT EIGHT (U+0038). An application that is expecting only the ASCII digits might be misled, or if the match is C<\d+>, the matched string might contain a mixture of digits from different writing systems that look like they signify a number different than they actually do. L can be used to safely calculate the value, returning C if the input string contains such a mixture. What C<\p{Digit}> means (and hence C<\d> except under the C modifier) is C<\p{General_Category=Decimal_Number}>, or synonymously, C<\p{General_Category=Digit}>. Starting with Unicode version 4.1, this is the same set of characters matched by C<\p{Numeric_Type=Decimal}>. But Unicode also has a different property with a similar name, C<\p{Numeric_Type=Digit}>, which matches a completely different set of characters. These characters are things such as C or subscripts, or are from writing systems that lack all ten digits. The design intent is for C<\d> to exactly match the set of characters that can safely be used with "normal" big-endian positional decimal syntax, where, for example 123 means one 'hundred', plus two 'tens', plus three 'ones'. This positional notation does not necessarily apply to characters that match the other type of "digit", C<\p{Numeric_Type=Digit}>, and so C<\d> doesn't match them. The Tamil digits (U+0BE6 - U+0BEF) can also legally be used in old-style Tamil numbers in which they would appear no more than one in a row, separated by characters that mean "times 10", "times 100", etc. (See L.) Any character not matched by C<\d> is matched by C<\D>. =head3 Word characters A C<\w> matches a single alphanumeric character (an alphabetic character, or a decimal digit) or a connecting punctuation character, such as an underscore ("_"). It does not match a whole word. To match a whole word, use C<\w+>. This isn't the same thing as matching an English word, but in the ASCII range it is the same as a string of Perl-identifier characters. =over =item If the C modifier is in effect ... C<\w> matches the 63 characters [a-zA-Z0-9_]. =item otherwise ... =over =item For code points above 255 ... C<\w> matches the same as C<\p{Word}> matches in this range. That is, it matches Thai letters, Greek letters, etc. This includes connector punctuation (like the underscore) which connect two words together, or diacritics, such as a C and the modifier letters, which are generally used to add auxiliary markings to letters. =item For code points below 256 ... =over =item if locale rules are in effect ... C<\w> matches the platform's native underscore character plus whatever the locale considers to be alphanumeric. =item if Unicode rules are in effect or if on an EBCDIC platform ... C<\w> matches exactly what C<\p{Word}> matches. =item otherwise ... C<\w> matches [a-zA-Z0-9_]. =back =back =back Which rules apply are determined as described in L. There are a number of security issues with the full Unicode list of word characters. See L. Also, for a somewhat finer-grained set of characters that are in programming language identifiers beyond the ASCII range, you may wish to instead use the more customized L, C<\p{ID_Start}>, C<\p{ID_Continue}>, C<\p{XID_Start}>, and C<\p{XID_Continue}>. See L. Any character not matched by C<\w> is matched by C<\W>. =head3 Whitespace C<\s> matches any single character considered whitespace. =over =item If the C modifier is in effect ... C<\s> matches the 5 characters [\t\n\f\r ]; that is, the horizontal tab, the newline, the form feed, the carriage return, and the space. (Note that it doesn't match the vertical tab, C<\cK> on ASCII platforms.) =item otherwise ... =over =item For code points above 255 ... C<\s> matches exactly the code points above 255 shown with an "s" column in the table below. =item For code points below 256 ... =over =item if locale rules are in effect ... C<\s> matches whatever the locale considers to be whitespace. Note that this is likely to include the vertical space, unlike non-locale C<\s> matching. =item if Unicode rules are in effect or if on an EBCDIC platform ... C<\s> matches exactly the characters shown with an "s" column in the table below. =item otherwise ... C<\s> matches [\t\n\f\r ]. Note that this list doesn't include the non-breaking space. =back =back =back Which rules apply are determined as described in L. Any character not matched by C<\s> is matched by C<\S>. C<\h> matches any character considered horizontal whitespace; this includes the platform's space and tab characters and several others listed in the table below. C<\H> matches any character not considered horizontal whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use. C<\v> matches any character considered vertical whitespace; this includes the platform's carriage return and line feed characters (newline) plus several other characters, all listed in the table below. C<\V> matches any character not considered vertical whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use. C<\R> matches anything that can be considered a newline under Unicode rules. It's not a character class, as it can match a multi-character sequence. Therefore, it cannot be used inside a bracketed character class; use C<\v> instead (vertical whitespace). It uses the platform's native character set, and does not consider any locale that may otherwise be in use. Details are discussed in L. Note that unlike C<\s> (and C<\d> and C<\w>), C<\h> and C<\v> always match the same characters, without regard to other factors, such as the active locale or whether the source string is in UTF-8 format. One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The difference is that the vertical tab (C<"\x0b">) is not matched by C<\s>; it is however considered vertical whitespace. The following table is a complete listing of characters matched by C<\s>, C<\h> and C<\v> as of Unicode 6.0. The first column gives the Unicode code point of the character (in hex format), the second column gives the (Unicode) name. The third column indicates by which class(es) the character is matched (assuming no locale or EBCDIC code page is in effect that changes the C<\s> matching). 0x0009 CHARACTER TABULATION h s 0x000a LINE FEED (LF) vs 0x000b LINE TABULATION v 0x000c FORM FEED (FF) vs 0x000d CARRIAGE RETURN (CR) vs 0x0020 SPACE h s 0x0085 NEXT LINE (NEL) vs [1] 0x00a0 NO-BREAK SPACE h s [1] 0x1680 OGHAM SPACE MARK h s 0x180e MONGOLIAN VOWEL SEPARATOR h s 0x2000 EN QUAD h s 0x2001 EM QUAD h s 0x2002 EN SPACE h s 0x2003 EM SPACE h s 0x2004 THREE-PER-EM SPACE h s 0x2005 FOUR-PER-EM SPACE h s 0x2006 SIX-PER-EM SPACE h s 0x2007 FIGURE SPACE h s 0x2008 PUNCTUATION SPACE h s 0x2009 THIN SPACE h s 0x200a HAIR SPACE h s 0x2028 LINE SEPARATOR vs 0x2029 PARAGRAPH SEPARATOR vs 0x202f NARROW NO-BREAK SPACE h s 0x205f MEDIUM MATHEMATICAL SPACE h s 0x3000 IDEOGRAPHIC SPACE h s =over 4 =item [1] NEXT LINE and NO-BREAK SPACE may or may not match C<\s> depending on the rules in effect. See L. =back =head3 Unicode Properties C<\pP> and C<\p{Prop}> are character classes to match characters that fit given Unicode properties. One letter property names can be used in the C<\pP> form, with the property name following the C<\p>, otherwise, braces are required. When using braces, there is a single form, which is just the property name enclosed in the braces, and a compound form which looks like C<\p{name=value}>, which means to match if the property "name" for the character has that particular "value". For instance, a match for a number can be written as C or as C, or as C. Lowercase letters are matched by the property I which has the short form I. They need the braces, so are written as C or C, or C (the underscores are optional). C is valid, but means something different. It matches a two character string: a letter (Unicode property C<\pL>), followed by a lowercase C. If neither the C modifier nor locale rules are in effect, the use of a Unicode property will force the regular expression into using Unicode rules. Note that almost all properties are immune to case-insensitive matching. That is, adding a C regular expression modifier does not change what they match. There are two sets that are affected. The first set is C, C, and C, all of which match C under C matching. The second set is C, C, and C, all of which match C under C matching. (The difference between these sets is that some things, such as Roman numerals, come in both upper and lower case, so they are C, but aren't considered to be letters, so they aren't Cs. They're actually Cs.) This set also includes its subsets C and C, both of which under C match C. For more details on Unicode properties, see L; for a complete list of possible properties, see L, which notes all forms that have C differences. It is also possible to define your own properties. This is discussed in L. Unicode properties are defined (surprise!) only on Unicode code points. A warning is raised and all matches fail on non-Unicode code points (those above the legal Unicode maximum of 0x10FFFF). This can be somewhat surprising, chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails. chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Also fails! Even though these two matches might be thought of as complements, they are so only on Unicode code points. =head4 Examples "a" =~ /\w/ # Match, "a" is a 'word' character. "7" =~ /\w/ # Match, "7" is a 'word' character as well. "a" =~ /\d/ # No match, "a" isn't a digit. "7" =~ /\d/ # Match, "7" is a digit. " " =~ /\s/ # Match, a space is whitespace. "a" =~ /\D/ # Match, "a" is a non-digit. "7" =~ /\D/ # No match, "7" is not a non-digit. " " =~ /\S/ # No match, a space is not non-whitespace. " " =~ /\h/ # Match, space is horizontal whitespace. " " =~ /\v/ # No match, space is not vertical whitespace. "\r" =~ /\v/ # Match, a return is vertical whitespace. "a" =~ /\pL/ # Match, "a" is a letter. "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character # 'THAI CHARACTER SO SO', and that's in # Thai Unicode class. "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. It is worth emphasizing that C<\d>, C<\w>, etc, match single characters, not complete numbers or words. To match a number (that consists of digits), use C<\d+>; to match a word, use C<\w+>. But be aware of the security considerations in doing so, as mentioned above. =head2 Bracketed Character Classes The third form of character class you can use in Perl regular expressions is the bracketed character class. In its simplest form, it lists the characters that may be matched, surrounded by square brackets, like this: C<[aeiou]>. This matches one of C, C, C, C or C. Like the other character classes, exactly one character is matched.* To match a longer string consisting of characters mentioned in the character class, follow the character class with a L. For instance, C<[aeiou]+> matches one or more lowercase English vowels. Repeating a character in a character class has no effect; it's considered to be in the set only once. Examples: "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. "ae" =~ /^[aeiou]$/ # No match, a character class only matches # a single character. "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. ------- * There is an exception to a bracketed character class matching a single character only. When the class is to match caselessly under C matching rules, and a character inside the class matches a multiple-character sequence caselessly under Unicode rules, the class (when not L) will also match that sequence. For example, Unicode says that the letter C should match the sequence C under C rules. Thus, 'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i # Matches 'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i # Matches =head3 Special Characters Inside a Bracketed Character Class Most characters that are meta characters in regular expressions (that is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose their special meaning and can be used inside a character class without the need to escape them. For instance, C<[()]> matches either an opening parenthesis, or a closing parenthesis, and the parens inside the character class don't group or capture. Characters that may carry a special meaning inside a character class are: C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be escaped with a backslash, although this is sometimes not needed, in which case the backslash may be omitted. The sequence C<\b> is special inside a bracketed character class. While outside the character class, C<\b> is an assertion indicating a point that does not have either two word characters or two non-word characters on either side, inside a bracketed character class, C<\b> matches a backspace character. The sequences C<\a>, C<\c>, C<\e>, C<\f>, C<\n>, C<\N{I}>, C<\N{U+I}>, C<\r>, C<\t>, and C<\x> are also special and have the same meanings as they do outside a bracketed character class. (However, inside a bracketed character class, if C<\N{I}> expands to a sequence of characters, only the first one in the sequence is used, with a warning.) Also, a backslash followed by two or three octal digits is considered an octal number. A C<[> is not special inside a character class, unless it's the start of a POSIX character class (see L below). It normally does not need escaping. A C<]> is normally either the end of a POSIX character class (see L below), or it signals the end of the bracketed character class. If you want to include a C<]> in the set of characters, you must generally escape it. However, if the C<]> is the I (or the second if the first character is a caret) character of a bracketed character class, it does not denote the end of the class (as you cannot have an empty class) and is considered part of the set of characters that can be matched without escaping. Examples: "+" =~ /[+?*]/ # Match, "+" in a character class is not special. "\cH" =~ /[\b]/ # Match, \b inside in a character class # is equivalent to a backspace. "]" =~ /[][]/ # Match, as the character class contains. # both [ and ]. "[]" =~ /[[]]/ # Match, the pattern contains a character class # containing just ], and the character class is # followed by a ]. =head3 Character Ranges It is not uncommon to want to match a range of characters. Luckily, instead of listing all characters in the range, one may use the hyphen (C<->). If inside a bracketed character class you have two characters separated by a hyphen, it's treated as if all characters between the two were in the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> matches any lowercase letter from the first half of the ASCII alphabet. Note that the two characters on either side of the hyphen are not necessarily both letters or both digits. Any character is possible, although not advisable. C<['-?]> contains a range of characters, but most people will not know which characters that means. Furthermore, such ranges may lead to portability problems if the code has to run on a platform that uses a different character set, such as EBCDIC. If a hyphen in a character class cannot syntactically be part of a range, for instance because it is the first or the last character of the character class, or if it immediately follows a range, the hyphen isn't special, and so is considered a character to be matched literally. If you want a hyphen in your set of characters to be matched and its position in the class is such that it could be considered part of a range, you must escape that hyphen with a backslash. Examples: [a-z] # Matches a character that is a lower case ASCII letter. [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or # the letter 'z'. [-z] # Matches either a hyphen ('-') or the letter 'z'. [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the # hyphen ('-'), or the letter 'm'. ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? # (But not on an EBCDIC platform). =head3 Negation It is also possible to instead list the characters you do not want to match. You can do so by using a caret (C<^>) as the first character in the character class. For instance, C<[^a-z]> matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted". This syntax make the caret a special character inside a bracketed character class, but only if it is the first character of the class. So if you want the caret as one of the characters to match, either escape the caret or else don't list it first. In inverted bracketed character classes, Perl ignores the Unicode rules that normally say that certain characters should match a sequence of multiple characters under caseless C matching. Following those rules could lead to highly confusing situations: "ss" =~ /^[^\xDF]+$/ui; # Matches! This should match any sequences of characters that aren't C<\xDF> nor what C<\xDF> matches under C. C<"s"> isn't C<\xDF>, but Unicode says that C<"ss"> is what C<\xDF> matches under C. So which one "wins"? Do you fail the match because the string has C or accept it because it has an C followed by another C? Perl has chosen the latter. Examples: "e" =~ /[^aeiou]/ # No match, the 'e' is listed. "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. "^" =~ /[^^]/ # No match, matches anything that isn't a caret. "^" =~ /[x^]/ # Match, caret is not special here. =head3 Backslash Sequences You can put any backslash sequence character class (with the exception of C<\N> and C<\R>) inside a bracketed character class, and it will act just as if you had put all characters matched by the backslash sequence inside the character class. For instance, C<[a-f\d]> matches any decimal digit, or any of the lowercase letters between 'a' and 'f' inclusive. C<\N> within a bracketed character class must be of the forms C<\N{I}> or C<\N{U+I}>, and NOT be the form that matches non-newlines, for the same reason that a dot C<.> inside a bracketed character class loses its special meaning: it matches nearly anything, which generally isn't what you want to happen. Examples: /[\p{Thai}\d]/ # Matches a character that is either a Thai # character, or a digit. /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic # character, nor a parenthesis. Backslash sequence character classes cannot form one of the endpoints of a range. Thus, you can't say: /[\p{Thai}-\d]/ # Wrong! =head3 POSIX Character Classes X X<\p> X<\p{}> X X X X X X X X X X X X X X POSIX character classes have the form C<[:class:]>, where I is name, and the C<[:> and C<:]> delimiters. POSIX character classes only appear I bracketed character classes, and are a convenient and descriptive way of listing a group of characters. Be careful about the syntax, # Correct: $string =~ /[[:alpha:]]/ # Incorrect (will warn): $string =~ /[:alpha:]/ The latter pattern would be a character class consisting of a colon, and the letters C, C, C

and C. POSIX character classes can be part of a larger bracketed character class. For example, [01[:alpha:]%] is valid and matches '0', '1', any alphabetic character, and the percent sign. Perl recognizes the following POSIX character classes: alpha Any alphabetical character ("[A-Za-z]"). alnum Any alphanumeric character. ("[A-Za-z0-9]") ascii Any character in the ASCII character set. blank A GNU extension, equal to a space or a horizontal tab ("\t"). cntrl Any control character. See Note [2] below. digit Any decimal digit ("[0-9]"), equivalent to "\d". graph Any printable character, excluding a space. See Note [3] below. lower Any lowercase character ("[a-z]"). print Any printable character, including a space. See Note [4] below. punct Any graphical character excluding "word" characters. Note [5]. space Any whitespace character. "\s" plus the vertical tab ("\cK"). upper Any uppercase character ("[A-Z]"). word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". xdigit Any hexadecimal digit ("[0-9a-fA-F]"). Most POSIX character classes have two Unicode-style C<\p> property counterparts. (They are not official Unicode properties, but Perl extensions derived from official Unicode properties.) The table below shows the relation between POSIX character classes and these counterparts. One counterpart, in the column labelled "ASCII-range Unicode" in the table, matches only characters in the ASCII character set. The other counterpart, in the column labelled "Full-range Unicode", matches any appropriate characters in the full Unicode character set. For example, C<\p{Alpha}> matches not just the ASCII alphabetic characters, but any character in the entire Unicode character set considered alphabetic. An entry in the column labelled "backslash sequence" is a (short) equivalent. [[:...:]] ASCII-range Full-range backslash Note Unicode Unicode sequence ----------------------------------------------------- alpha \p{PosixAlpha} \p{XPosixAlpha} alnum \p{PosixAlnum} \p{XPosixAlnum} ascii \p{ASCII} blank \p{PosixBlank} \p{XPosixBlank} \h [1] or \p{HorizSpace} [1] cntrl \p{PosixCntrl} \p{XPosixCntrl} [2] digit \p{PosixDigit} \p{XPosixDigit} \d graph \p{PosixGraph} \p{XPosixGraph} [3] lower \p{PosixLower} \p{XPosixLower} print \p{PosixPrint} \p{XPosixPrint} [4] punct \p{PosixPunct} \p{XPosixPunct} [5] \p{PerlSpace} \p{XPerlSpace} \s [6] space \p{PosixSpace} \p{XPosixSpace} [6] upper \p{PosixUpper} \p{XPosixUpper} word \p{PosixWord} \p{XPosixWord} \w xdigit \p{PosixXDigit} \p{XPosixXDigit} =over 4 =item [1] C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. =item [2] Control characters don't produce output as such, but instead usually control the terminal somehow: for example, newline and backspace are control characters. In the ASCII range, characters whose code points are between 0 and 31 inclusive, plus 127 (C) are control characters. On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> to be the EBCDIC equivalents of the ASCII controls, plus the controls that in Unicode have code pointss from 128 through 159. =item [3] Any character that is I, that is, visible. This class consists of all alphanumeric characters and all punctuation characters. =item [4] All printable characters, which is the set of all graphical characters plus those whitespace characters which are not also controls. =item [5] C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all non-controls, non-alphanumeric, non-space characters: C<[-!"#$%&'()*+,./:;<=E?@[\\\]^_`{|}~]> (although if a locale is in effect, it could alter the behavior of C<[[:punct:]]>). The similarly named property, C<\p{Punct}>, matches a somewhat different set in the ASCII range, namely C<[-!"#%&'()*,./:;?@[\\\]_{}]>. That is, it is missing the nine characters C<[$+E=E^`|~]>. This is because Unicode splits what POSIX considers to be punctuation into two categories, Punctuation and Symbols. C<\p{XPosixPunct}> and (under Unicode rules) C<[[:punct:]]>, match what C<\p{PosixPunct}> matches in the ASCII range, plus what C<\p{Punct}> matches. This is different than strictly matching according to C<\p{Punct}>. Another way to say it is that if Unicode rules are in effect, C<[[:punct:]]> matches all characters that Unicode considers punctuation, plus all ASCII-range characters that Unicode considers symbols. =item [6] C<\p{SpacePerl}> and C<\p{Space}> differ only in that in non-locale matching, C<\p{Space}> additionally matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms. =back There are various other synonyms that can be used besides the names listed in the table. For example, C<\p{PosixAlpha}> can be written as C<\p{Alpha}>. All are listed in L, plus all characters matched by each ASCII-range property. Both the C<\p> counterparts always assume Unicode rules are in effect. On ASCII platforms, this means they assume that the code points from 128 to 255 are Latin-1, and that means that using them under locale rules is unwise unless the locale is guaranteed to be Latin-1 or UTF-8. In contrast, the POSIX character classes are useful under locale rules. They are affected by the actual rules in effect, as follows: =over =item If the C modifier, is in effect ... Each of the POSIX classes matches exactly the same as their ASCII-range counterparts. =item otherwise ... =over =item For code points above 255 ... The POSIX class matches the same as its Full-range counterpart. =item For code points below 256 ... =over =item if locale rules are in effect ... The POSIX class matches according to the locale, except that C uses the platform's native underscore character, no matter what the locale is. =item if Unicode rules are in effect or if on an EBCDIC platform ... The POSIX class matches the same as the Full-range counterpart. =item otherwise ... The POSIX class matches the same as the ASCII range counterpart. =back =back =back Which rules apply are determined as described in L. It is proposed to change this behavior in a future release of Perl so that whether or not Unicode rules are in effect would not change the behavior: Outside of locale or an EBCDIC code page, the POSIX classes would behave like their ASCII-range counterparts. If you wish to comment on this proposal, send email to C. =head4 Negation of POSIX character classes X A Perl extension to the POSIX character class is the ability to negate it. This is done by prefixing the class name with a caret (C<^>). Some examples: POSIX ASCII-range Full-range backslash Unicode Unicode sequence ----------------------------------------------------- [[:^digit:]] \P{PosixDigit} \P{XPosixDigit} \D [[:^space:]] \P{PosixSpace} \P{XPosixSpace} \P{PerlSpace} \P{XPerlSpace} \S [[:^word:]] \P{PerlWord} \P{XPosixWord} \W The backslash sequence can mean either ASCII- or Full-range Unicode, depending on various factors as described in L. =head4 [= =] and [. .] Perl recognizes the POSIX character classes C<[=class=]> and C<[.class.]>, but does not (yet?) support them. Any attempt to use either construct raises an exception. =head4 Examples /[[:digit:]]/ # Matches a character that is a digit. /[01[:lower:]]/ # Matches a character that is either a # lowercase letter, or '0' or '1'. /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything # except the letters 'a' to 'f'. This is # because the main character class is composed # of two POSIX character classes that are ORed # together, one that matches any digit, and # the other that matches anything that isn't a # hex digit. The result matches all # characters except the letters 'a' to 'f' and # 'A' to 'F'. perlos390.pod000064400000036315147633747260007043 0ustar00This document is written in pod format hence there are punctuation characters in odd places. Do not worry, you've apparently got the ASCII->EBCDIC translation worked out correctly. You can read more about pod in pod/perlpod.pod or the short summary in the INSTALL file. =head1 NAME perlos390 - building and installing Perl for OS/390 and z/OS =head1 SYNOPSIS This document will help you Configure, build, test and install Perl on OS/390 (aka z/OS) Unix System Services. =head1 DESCRIPTION This is a fully ported Perl for OS/390 Version 2 Release 3, 5, 6, 7, 8, and 9. It may work on other versions or releases, but those are the ones we've tested it on. You may need to carry out some system configuration tasks before running the Configure script for Perl. =head2 Tools The z/OS Unix Tools and Toys list may prove helpful and contains links to ports of much of the software helpful for building Perl. http://www.ibm.com/servers/eserver/zseries/zos/unix/bpxa1toy.html =head2 Unpacking Perl distribution on OS/390 If using ftp remember to transfer the distribution in binary format. Gunzip/gzip for OS/390 is discussed at: http://www.ibm.com/servers/eserver/zseries/zos/unix/bpxa1ty1.html to extract an ASCII tar archive on OS/390, try this: pax -o to=IBM-1047,from=ISO8859-1 -r < latest.tar or zcat latest.tar.Z | pax -o to=IBM-1047,from=ISO8859-1 -r If you get lots of errors of the form tar: FSUM7171 ...: cannot set uid/gid: EDC5139I Operation not permitted. you didn't read the above and tried to use tar instead of pax, you'll first have to remove the (now corrupt) perl directory rm -rf perl-... and then use pax. =head2 Setup and utilities for Perl on OS/390 Be sure that your yacc installation is in place including any necessary parser template files. If you have not already done so then be sure to: cp /samples/yyparse.c /etc This may also be a good time to ensure that your /etc/protocol file and either your /etc/resolv.conf or /etc/hosts files are in place. The IBM document that described such USS system setup issues was SC28-1890-07 "OS/390 UNIX System Services Planning", in particular Chapter 6 on customizing the OE shell. GNU make for OS/390, which is recommended for the build of perl (as well as building CPAN modules and extensions), is available from the L. Some people have reported encountering "Out of memory!" errors while trying to build Perl using GNU make binaries. If you encounter such trouble then try to download the source code kit and build GNU make from source to eliminate any such trouble. You might also find GNU make (as well as Perl and Apache) in the red-piece/book "Open Source Software for OS/390 UNIX", SG24-5944-00 from IBM. If instead of the recommended GNU make you would like to use the system supplied make program then be sure to install the default rules file properly via the shell command: cp /samples/startup.mk /etc and be sure to also set the environment variable _C89_CCMODE=1 (exporting _C89_CCMODE=1 is also a good idea for users of GNU make). You might also want to have GNU groff for OS/390 installed before running the "make install" step for Perl. There is a syntax error in the /usr/include/sys/socket.h header file that IBM supplies with USS V2R7, V2R8, and possibly V2R9. The problem with the header file is that near the definition of the SO_REUSEPORT constant there is a spurious extra '/' character outside of a comment like so: #define SO_REUSEPORT 0x0200 /* allow local address & port reuse */ / You could edit that header yourself to remove that last '/', or you might note that Language Environment (LE) APAR PQ39997 describes the problem and PTF's UQ46272 and UQ46271 are the (R8 at least) fixes and apply them. If left unattended that syntax error will turn up as an inability for Perl to build its "Socket" extension. For successful testing you may need to turn on the sticky bit for your world readable /tmp directory if you have not already done so (see man chmod). =head2 Configure Perl on OS/390 Once you've unpacked the distribution, run "sh Configure" (see INSTALL for a full discussion of the Configure options). There is a "hints" file for os390 that specifies the correct values for most things. Some things to watch out for include: =over 4 =item * A message of the form: (I see you are using the Korn shell. Some ksh's blow up on Configure, mainly on older exotic systems. If yours does, try the Bourne shell instead.) is nothing to worry about at all. =item * Some of the parser default template files in /samples are needed in /etc. In particular be sure that you at least copy /samples/yyparse.c to /etc before running Perl's Configure. This step ensures successful extraction of EBCDIC versions of parser files such as perly.c, perly.h, and x2p/a2p.c. This has to be done before running Configure the first time. If you failed to do so then the easiest way to re-Configure Perl is to delete your misconfigured build root and re-extract the source from the tar ball. Then you must ensure that /etc/yyparse.c is properly in place before attempting to re-run Configure. =item * This port will support dynamic loading, but it is not selected by default. If you would like to experiment with dynamic loading then be sure to specify -Dusedl in the arguments to the Configure script. See the comments in hints/os390.sh for more information on dynamic loading. If you build with dynamic loading then you will need to add the $archlibexp/CORE directory to your LIBPATH environment variable in order for perl to work. See the config.sh file for the value of $archlibexp. If in trying to use Perl you see an error message similar to: CEE3501S The module libperl.dll was not found. From entry point __dllstaticinit at compile unit offset +00000194 at then your LIBPATH does not have the location of libperl.x and either libperl.dll or libperl.so in it. Add that directory to your LIBPATH and proceed. =item * Do not turn on the compiler optimization flag "-O". There is a bug in either the optimizer or perl that causes perl to not work correctly when the optimizer is on. =item * Some of the configuration files in /etc used by the networking APIs are either missing or have the wrong names. In particular, make sure that there's either an /etc/resolv.conf or an /etc/hosts, so that gethostbyname() works, and make sure that the file /etc/proto has been renamed to /etc/protocol (NOT /etc/protocols, as used by other Unix systems). You may have to look for things like HOSTNAME and DOMAINORIGIN in the "//'SYS1.TCPPARMS(TCPDATA)'" PDS member in order to properly set up your /etc networking files. =back =head2 Build, Test, Install Perl on OS/390 Simply put: sh Configure make make test if everything looks ok (see the next section for test/IVP diagnosis) then: make install this last step may or may not require UID=0 privileges depending on how you answered the questions that Configure asked and whether or not you have write access to the directories you specified. =head2 Build Anomalies with Perl on OS/390 "Out of memory!" messages during the build of Perl are most often fixed by re building the GNU make utility for OS/390 from a source code kit. Another memory limiting item to check is your MAXASSIZE parameter in your 'SYS1.PARMLIB(BPXPRMxx)' data set (note too that as of V2R8 address space limits can be set on a per user ID basis in the USS segment of a RACF profile). People have reported successful builds of Perl with MAXASSIZE parameters as small as 503316480 (and it may be possible to build Perl with a MAXASSIZE smaller than that). Within USS your /etc/profile or $HOME/.profile may limit your ulimit settings. Check that the following command returns reasonable values: ulimit -a To conserve memory you should have your compiler modules loaded into the Link Pack Area (LPA/ELPA) rather than in a link list or step lib. If the c89 compiler complains of syntax errors during the build of the Socket extension then be sure to fix the syntax error in the system header /usr/include/sys/socket.h. =head2 Testing Anomalies with Perl on OS/390 The "make test" step runs a Perl Verification Procedure, usually before installation. You might encounter STDERR messages even during a successful run of "make test". Here is a guide to some of the more commonly seen anomalies: =over 4 =item * A message of the form: io/openpid...........CEE5210S The signal SIGHUP was received. CEE5210S The signal SIGHUP was received. CEE5210S The signal SIGHUP was received. ok indicates that the t/io/openpid.t test of Perl has passed but done so with extraneous messages on stderr from CEE. =item * A message of the form: lib/ftmp-security....File::Temp::_gettemp: Parent directory (/tmp/) is not safe (sticky bit not set when world writable?) at lib/ftmp-security.t line 100 File::Temp::_gettemp: Parent directory (/tmp/) is not safe (sticky bit not set when world writable?) at lib/ftmp-security.t line 100 ok indicates a problem with the permissions on your /tmp directory within the HFS. To correct that problem issue the command: chmod a+t /tmp from an account with write access to the directory entry for /tmp. =item * Out of Memory! Recent perl test suite is quite memory hungry. In addition to the comments above on memory limitations it is also worth checking for _CEE_RUNOPTS in your environment. Perl now has (in miniperlmain.c) a C #pragma to set CEE run options, but the environment variable wins. The C code asks for: #pragma runopts(HEAP(2M,500K,ANYWHERE,KEEP,8K,4K) STACK(,,ANY,) ALL31(ON)) The important parts of that are the second argument (the increment) to HEAP, and allowing the stack to be "Above the (16M) line". If the heap increment is too small then when perl (for example loading unicode/Name.pl) tries to create a "big" (400K+) string it cannot fit in a single segment and you get "Out of Memory!" - even if there is still plenty of memory available. A related issue is use with perl's malloc. Perl's malloc uses C to get memory, and C is limited to the first allocation so in this case something like: HEAP(8M,500K,ANYWHERE,KEEP,8K,4K) is needed to get through the test suite. =back =head2 Installation Anomalies with Perl on OS/390 The installman script will try to run on OS/390. There will be fewer errors if you have a roff utility installed. You can obtain GNU groff from the Redbook SG24-5944-00 ftp site. =head2 Usage Hints for Perl on OS/390 When using perl on OS/390 please keep in mind that the EBCDIC and ASCII character sets are different. See perlebcdic.pod for more on such character set issues. Perl builtin functions that may behave differently under EBCDIC are also mentioned in the perlport.pod document. Open Edition (UNIX System Services) from V2R8 onward does support #!/path/to/perl script invocation. There is a PTF available from IBM for V2R7 that will allow shell/kernel support for #!. USS releases prior to V2R7 did not support the #! means of script invocation. If you are running V2R6 or earlier then see: head `whence perldoc` for an example of how to use the "eval exec" trick to ask the shell to have Perl run your scripts on those older releases of Unix System Services. If you are having trouble with square brackets then consider switching your rlogin or telnet client. Try to avoid older 3270 emulators and ISHELL for working with Perl on USS. =head2 Floating Point Anomalies with Perl on OS/390 There appears to be a bug in the floating point implementation on S/390 systems such that calling int() on the product of a number and a small magnitude number is not the same as calling int() on the quotient of that number and a large magnitude number. For example, in the following Perl code: my $x = 100000.0; my $y = int($x * 1e-5) * 1e5; # '0' my $z = int($x / 1e+5) * 1e5; # '100000' print "\$y is $y and \$z is $z\n"; # $y is 0 and $z is 100000 Although one would expect the quantities $y and $z to be the same and equal to 100000 they will differ and instead will be 0 and 100000 respectively. The problem can be further examined in a roughly equivalent C program: #include #include main() { double r1,r2; double x = 100000.0; double y = 0.0; double z = 0.0; x = 100000.0 * 1e-5; r1 = modf (x,&y); x = 100000.0 / 1e+5; r2 = modf (x,&z); printf("y is %e and z is %e\n",y*1e5,z*1e5); /* y is 0.000000e+00 and z is 1.000000e+05 (with c89) */ } =head2 Modules and Extensions for Perl on OS/390 Pure pure (that is non xs) modules may be installed via the usual: perl Makefile.PL make make test make install If you built perl with dynamic loading capability then that would also be the way to build xs based extensions. However, if you built perl with the default static linking you can still build xs based extensions for OS/390 but you will need to follow the instructions in ExtUtils::MakeMaker for building statically linked perl binaries. In the simplest configurations building a static perl + xs extension boils down to: perl Makefile.PL make make perl make test make install make -f Makefile.aperl inst_perl MAP_TARGET=perl In most cases people have reported better results with GNU make rather than the system's /bin/make program, whether for plain modules or for xs based extensions. If the make process encounters trouble with either compilation or linking then try setting the _C89_CCMODE to 1. Assuming sh is your login shell then run: export _C89_CCMODE=1 If tcsh is your login shell then use the setenv command. =head1 AUTHORS David Fiander and Peter Prymmer with thanks to Dennis Longnecker and William Raffloer for valuable reports, LPAR and PTF feedback. Thanks to Mike MacIsaac and Egon Terwedow for SG24-5944-00. Thanks to Ignasi Roca for pointing out the floating point problems. Thanks to John Goodyear for dynamic loading help. =head1 SEE ALSO L, L, L, L. http://www.ibm.com/servers/eserver/zseries/zos/unix/bpxa1toy.html http://www.redbooks.ibm.com/redbooks/SG245944.html http://www.ibm.com/servers/eserver/zseries/zos/unix/bpxa1ty1.html#opensrc http://www.xray.mpe.mpg.de/mailing-lists/perl-mvs/ http://publibz.boulder.ibm.com:80/cgi-bin/bookmgr_OS390/BOOKS/ceea3030/ http://publibz.boulder.ibm.com:80/cgi-bin/bookmgr_OS390/BOOKS/CBCUG030/ =head2 Mailing list for Perl on OS/390 If you are interested in the VM/ESA, z/OS (formerly known as OS/390) and POSIX-BC (BS2000) ports of Perl then see the perl-mvs mailing list. To subscribe, send an empty message to perl-mvs-subscribe@perl.org. See also: http://lists.perl.org/list/perl-mvs.html There are web archives of the mailing list at: http://www.xray.mpe.mpg.de/mailing-lists/perl-mvs/ http://archive.develooper.com/perl-mvs@perl.org/ =head1 HISTORY This document was originally written by David Fiander for the 5.005 release of Perl. This document was podified for the 5.005_03 release of Perl 11 March 1999. Updated 28 November 2001 for broken URLs. Updated 12 November 2000 for the 5.7.1 release of Perl. Updated 15 January 2001 for the 5.7.1 release of Perl. Updated 24 January 2001 to mention dynamic loading. Updated 12 March 2001 to mention //'SYS1.TCPPARMS(TCPDATA)'. =cut perlpod.pod000064400000052424147633747260006747 0ustar00 =for comment This document is in Pod format. To read this, use a Pod formatter, like "perldoc perlpod". =head1 NAME X X perlpod - the Plain Old Documentation format =head1 DESCRIPTION Pod is a simple-to-use markup language used for writing documentation for Perl, Perl programs, and Perl modules. Translators are available for converting Pod to various formats like plain text, HTML, man pages, and more. Pod markup consists of three basic kinds of paragraphs: L, L, and L. =head2 Ordinary Paragraph X Most paragraphs in your documentation will be ordinary blocks of text, like this one. You can simply type in your text without any markup whatsoever, and with just a blank line before and after. When it gets formatted, it will undergo minimal formatting, like being rewrapped, probably put into a proportionally spaced font, and maybe even justified. You can use formatting codes in ordinary paragraphs, for B, I, C, L, and more. Such codes are explained in the "L" section, below. =head2 Verbatim Paragraph X X Verbatim paragraphs are usually used for presenting a codeblock or other text which does not require any special parsing or formatting, and which shouldn't be wrapped. A verbatim paragraph is distinguished by having its first character be a space or a tab. (And commonly, all its lines begin with spaces and/or tabs.) It should be reproduced exactly, with tabs assumed to be on 8-column boundaries. There are no special formatting codes, so you can't italicize or anything like that. A \ means \, and nothing else. =head2 Command Paragraph X A command paragraph is used for special treatment of whole chunks of text, usually as headings or parts of lists. All command paragraphs (which are typically only one line long) start with "=", followed by an identifier, followed by arbitrary text that the command can use however it pleases. Currently recognized commands are =pod =head1 Heading Text =head2 Heading Text =head3 Heading Text =head4 Heading Text =over indentlevel =item stuff =back =begin format =end format =for format text... =encoding type =cut To explain them each in detail: =over =item C<=head1 I> X<=head1> X<=head2> X<=head3> X<=head4> X X X X =item C<=head2 I> =item C<=head3 I> =item C<=head4 I> Head1 through head4 produce headings, head1 being the highest level. The text in the rest of this paragraph is the content of the heading. For example: =head2 Object Attributes The text "Object Attributes" comprises the heading there. (Note that head3 and head4 are recent additions, not supported in older Pod translators.) The text in these heading commands can use formatting codes, as seen here: =head2 Possible Values for C<$/> Such commands are explained in the "L" section, below. =item C<=over I> X<=over> X<=item> X<=back> X X X =item C<=item I> =item C<=back> Item, over, and back require a little more explanation: "=over" starts a region specifically for the generation of a list using "=item" commands, or for indenting (groups of) normal paragraphs. At the end of your list, use "=back" to end it. The I option to "=over" indicates how far over to indent, generally in ems (where one em is the width of an "M" in the document's base font) or roughly comparable units; if there is no I option, it defaults to four. (And some formatters may just ignore whatever I you provide.) In the I in C<=item I>, you may use formatting codes, as seen here: =item Using C<$|> to Control Buffering Such commands are explained in the "L" section, below. Note also that there are some basic rules to using "=over" ... "=back" regions: =over =item * Don't use "=item"s outside of an "=over" ... "=back" region. =item * The first thing after the "=over" command should be an "=item", unless there aren't going to be any items at all in this "=over" ... "=back" region. =item * Don't put "=headI" commands inside an "=over" ... "=back" region. =item * And perhaps most importantly, keep the items consistent: either use "=item *" for all of them, to produce bullets; or use "=item 1.", "=item 2.", etc., to produce numbered lists; or use "=item foo", "=item bar", etc.--namely, things that look nothing like bullets or numbers. If you start with bullets or numbers, stick with them, as formatters use the first "=item" type to decide how to format the list. =back =item C<=cut> X<=cut> X To end a Pod block, use a blank line, then a line beginning with "=cut", and a blank line after it. This lets Perl (and the Pod formatter) know that this is where Perl code is resuming. (The blank line before the "=cut" is not technically necessary, but many older Pod processors require it.) =item C<=pod> X<=pod> X The "=pod" command by itself doesn't do much of anything, but it signals to Perl (and Pod formatters) that a Pod block starts here. A Pod block starts with I command paragraph, so a "=pod" command is usually used just when you want to start a Pod block with an ordinary paragraph or a verbatim paragraph. For example: =item stuff() This function does stuff. =cut sub stuff { ... } =pod Remember to check its return value, as in: stuff() || die "Couldn't do stuff!"; =cut =item C<=begin I> X<=begin> X<=end> X<=for> X X X =item C<=end I> =item C<=for I I> For, begin, and end will let you have regions of text/code/data that are not generally interpreted as normal Pod text, but are passed directly to particular formatters, or are otherwise special. A formatter that can use that format will use the region, otherwise it will be completely ignored. A command "=begin I", some paragraphs, and a command "=end I", mean that the text/data in between is meant for formatters that understand the special format called I. For example, =begin html


This is a raw HTML paragraph

=end html The command "=for I I" specifies that the remainder of just this paragraph (starting right after I) is in that special format. =for html

This is a raw HTML paragraph

This means the same thing as the above "=begin html" ... "=end html" region. That is, with "=for", you can have only one paragraph's worth of text (i.e., the text in "=foo targetname text..."), but with "=begin targetname" ... "=end targetname", you can have any amount of stuff in between. (Note that there still must be a blank line after the "=begin" command and a blank line before the "=end" command. Here are some examples of how to use these: =begin html
Figure 1.

=end html =begin text --------------- | foo | | bar | --------------- ^^^^ Figure 1. ^^^^ =end text Some format names that formatters currently are known to accept include "roff", "man", "latex", "tex", "text", and "html". (Some formatters will treat some of these as synonyms.) A format name of "comment" is common for just making notes (presumably to yourself) that won't appear in any formatted version of the Pod document: =for comment Make sure that all the available options are documented! Some I will require a leading colon (as in C<"=for :formatname">, or C<"=begin :formatname" ... "=end :formatname">), to signal that the text is not raw data, but instead I Pod text (i.e., possibly containing formatting codes) that's just not for normal formatting (e.g., may not be a normal-use paragraph, but might be for formatting as a footnote). =item C<=encoding I> X<=encoding> X This command is used for declaring the encoding of a document. Most users won't need this; but if your encoding isn't US-ASCII or Latin-1, then put a C<=encoding I> command early in the document so that pod formatters will know how to decode the document. For I, use a name recognized by the L module. Examples: =encoding utf8 =encoding koi8-r =encoding ShiftJIS =encoding big5 =back C<=encoding> affects the whole document, and must occur only once. And don't forget, when using any other command, that the command lasts up until the end of its I, not its line. So in the examples below, you can see that every command needs the blank line after it, to end its paragraph. Some examples of lists include: =over =item * First item =item * Second item =back =over =item Foo() Description of Foo function =item Bar() Description of Bar function =back =head2 Formatting Codes X X X X In ordinary paragraphs and in some command paragraphs, various formatting codes (a.k.a. "interior sequences") can be used: =for comment "interior sequences" is such an opaque term. Prefer "formatting codes" instead. =over =item CtextE> -- italic text X X<< IZ<><> >> X X Used for emphasis ("Ccareful!E>") and parameters ("CLABELE>") =item CtextE> -- bold text X X<< BZ<><> >> X X Used for switches ("C-nE switch>"), programs ("CchfnE for that>"), emphasis ("Ccareful!E>"), and so on ("CautovivificationE>"). =item CcodeE> -- code text X X<< CZ<><> >> X X Renders code in a typewriter font, or gives some other indication that this represents program text ("Cgmtime($^T)E>") or some other form of computerese ("Cdrwxr-xr-xE>"). =item CnameE> -- a hyperlink X X<< LZ<><> >> X X There are various syntaxes, listed below. In the syntaxes given, C, C, and C
cannot contain the characters '/' and '|'; and any '<' or '>' should be matched. =over =item * CnameE> Link to a Perl manual page (e.g., CNet::PingE>). Note that C should not contain spaces. This syntax is also occasionally used for references to Unix man pages, as in Ccrontab(5)E>. =item * Cname/"sec"E> or Cname/secE> Link to a section in other manual page. E.g., Cperlsyn/"For Loops"E> =item * C/"sec"E> or C/secE> Link to a section in this manual page. E.g., C/"Object Methods"E> =back A section is started by the named heading or item. For example, Cperlvar/$.E> or Cperlvar/"$."E> both link to the section started by "C<=item $.>" in perlvar. And Cperlsyn/For LoopsE> or Cperlsyn/"For Loops"E> both link to the section started by "C<=head2 For Loops>" in perlsyn. To control what text is used for display, you use "Ctext|...E>", as in: =over =item * Ctext|nameE> Link this text to that manual page. E.g., CPerl Error Messages|perldiagE> =item * Ctext|name/"sec"E> or Ctext|name/secE> Link this text to that section in that manual page. E.g., Cpostfix "if"|perlsyn/"Statement Modifiers"E> =item * Ctext|/"sec"E> or Ctext|/secE> or Ctext|"sec"E> Link this text to that section in this manual page. E.g., Cthe various attributes|/"Member Data"E> =back Or you can link to a web page: =over =item * Cscheme:...E> Ctext|scheme:...E> Links to an absolute URL. For example, Chttp://www.perl.org/E> or CThe Perl Home Page|http://www.perl.org/E>. =back =item CescapeE> -- a character escape X X<< EZ<><> >> X X Very similar to HTML/XML C<&I;> "entity references": =over =item * CltE> -- a literal E (less than) =item * CgtE> -- a literal E (greater than) =item * CverbarE> -- a literal | (Itical I) =item * CsolE> -- a literal / (Iidus) The above four are optional except in other formatting codes, notably C...E>, and when preceded by a capital letter. =item * ChtmlnameE> Some non-numeric HTML entity name, such as CeacuteE>, meaning the same thing as C<é> in HTML -- i.e., a lowercase e with an acute (/-shaped) accent. =item * CnumberE> The ASCII/Latin-1/Unicode character with that number. A leading "0x" means that I is hex, as in C0x201EE>. A leading "0" means that I is octal, as in C075E>. Otherwise I is interpreted as being in decimal, as in C181E>. Note that older Pod formatters might not recognize octal or hex numeric escapes, and that many formatters cannot reliably render characters above 255. (Some formatters may even have to use compromised renderings of Latin-1 characters, like rendering CeacuteE> as just a plain "e".) =back =item CfilenameE> -- used for filenames X X<< FZ<><> >> X X Typically displayed in italics. Example: "C.cshrcE>" =item CtextE> -- text contains non-breaking spaces X X<< SZ<><> >> X X This means that the words in I should not be broken across lines. Example: S$x ? $y : $zE>>. =item Ctopic nameE> -- an index entry X X<< XZ<><> >> X X This is ignored by most formatters, but some may use it for building indexes. It always renders as empty-string. Example: Cabsolutizing relative URLsE> =item CE> -- a null (zero-effect) formatting code X X<< ZZ<><> >> X X This is rarely used. It's one way to get around using an EE...E code sometimes. For example, instead of "CltE3>" (for "NE3") you could write "CEE3>" (the "ZEE" breaks up the "N" and the "E" so they can't be considered the part of a (fictitious) "NE...E" code. =for comment This was formerly explained as a "zero-width character". But it in most parser models, it parses to nothing at all, as opposed to parsing as if it were a E or E, which are REAL zero-width characters. So "width" and "character" are exactly the wrong words. =back Most of the time, you will need only a single set of angle brackets to delimit the beginning and end of formatting codes. However, sometimes you will want to put a real right angle bracket (a greater-than sign, '>') inside of a formatting code. This is particularly common when using a formatting code to provide a different font-type for a snippet of code. As with all things in Perl, there is more than one way to do it. One way is to simply escape the closing bracket using an C code: C<$a E=E $b> This will produce: "C<$a E=E $b>" A more readable, and perhaps more "plain" way is to use an alternate set of delimiters that doesn't require a single ">" to be escaped. Doubled angle brackets ("<<" and ">>") may be used I For example, the following will do the trick: X C<< $a <=> $b >> In fact, you can use as many repeated angle-brackets as you like so long as you have the same number of them in the opening and closing delimiters, and make sure that whitespace immediately follows the last '<' of the opening delimiter, and immediately precedes the first '>' of the closing delimiter. (The whitespace is ignored.) So the following will also work: X C<<< $a <=> $b >>> C<<<< $a <=> $b >>>> And they all mean exactly the same as this: C<$a E=E $b> The multiple-bracket form does not affect the interpretation of the contents of the formatting code, only how it must end. That means that the examples above are also exactly the same as this: C<< $a E=E $b >> As a further example, this means that if you wanted to put these bits of code in C (code) style: open(X, ">>thing.dat") || die $! $foo->bar(); you could do it like so: C<<< open(X, ">>thing.dat") || die $! >>> C<< $foo->bar(); >> which is presumably easier to read than the old way: CEthing.dat") || die $!> C<$foo-Ebar();> This is currently supported by pod2text (Pod::Text), pod2man (Pod::Man), and any other pod2xxx or Pod::Xxxx translators that use Pod::Parser 1.093 or later, or Pod::Tree 1.02 or later. =head2 The Intent X The intent is simplicity of use, not power of expression. Paragraphs look like paragraphs (block format), so that they stand out visually, and so that I could run them through C easily to reformat them (that's F7 in my version of B, or Esc Q in my version of B). I wanted the translator to always leave the C<'> and C<`> and C<"> quotes alone, in verbatim mode, so I could slurp in a working program, shift it over four spaces, and have it print out, er, verbatim. And presumably in a monospace font. The Pod format is not necessarily sufficient for writing a book. Pod is just meant to be an idiot-proof common source for nroff, HTML, TeX, and other markup languages, as used for online documentation. Translators exist for B, B, B (that's for nroff(1) and troff(1)), B, and B. Various others are available in CPAN. =head2 Embedding Pods in Perl Modules X You can embed Pod documentation in your Perl modules and scripts. Start your documentation with an empty line, a "=head1" command at the beginning, and end it with a "=cut" command and an empty line. Perl will ignore the Pod text. See any of the supplied library modules for examples. If you're going to put your Pod at the end of the file, and you're using an __END__ or __DATA__ cut mark, make sure to put an empty line there before the first Pod command. __END__ =head1 NAME Time::Local - efficiently compute time from local and GMT time Without that empty line before the "=head1", many translators wouldn't have recognized the "=head1" as starting a Pod block. =head2 Hints for Writing Pod =over =item * X X The B command is provided for checking Pod syntax for errors and warnings. For example, it checks for completely blank lines in Pod blocks and for unknown commands and formatting codes. You should still also pass your document through one or more translators and proofread the result, or print out the result and proofread that. Some of the problems found may be bugs in the translators, which you may or may not wish to work around. =item * If you're more familiar with writing in HTML than with writing in Pod, you can try your hand at writing documentation in simple HTML, and converting it to Pod with the experimental L module, (available in CPAN), and looking at the resulting code. The experimental L module in CPAN might also be useful. =item * Many older Pod translators require the lines before every Pod command and after every Pod command (including "=cut"!) to be a blank line. Having something like this: # - - - - - - - - - - - - =item $firecracker->boom() This noisily detonates the firecracker object. =cut sub boom { ... ...will make such Pod translators completely fail to see the Pod block at all. Instead, have it like this: # - - - - - - - - - - - - =item $firecracker->boom() This noisily detonates the firecracker object. =cut sub boom { ... =item * Some older Pod translators require paragraphs (including command paragraphs like "=head2 Functions") to be separated by I empty lines. If you have an apparently empty line with some spaces on it, this might not count as a separator for those translators, and that could cause odd formatting. =item * Older translators might add wording around an LEE link, so that CFoo::BarE> may become "the Foo::Bar manpage", for example. So you shouldn't write things like CfooE documentation>, if you want the translated document to read sensibly. Instead, write CFoo::Bar|Foo::BarE documentation> or Cthe Foo::Bar documentation|Foo::BarE>, to control how the link comes out. =item * Going past the 70th column in a verbatim block might be ungracefully wrapped by some formatters. =back =head1 SEE ALSO L, L, L, L, L, L, L. =head1 AUTHOR Larry Wall, Sean M. Burke =cut perldgux.pod000064400000005404147633747260007130 0ustar00If you read this file _as_is_, just ignore the funny characters you see. It is written in the POD format (see perlpod manpage) which is specially designed to be readable as is. =head1 NAME perldgux - Perl under DG/UX. =head1 SYNOPSIS One can read this document in the following formats: man perldgux view perl perldgux explorer perldgux.html info perldgux to list some (not all may be available simultaneously), or it may be read I: as F. =cut Contents perldgux - Perl under DG/UX. NAME SYNOPSIS DESCRIPTION BUILD - Non-threaded Case - Threaded Case - Testing - Installing the built perl AUTHOR SEE ALSO =head1 DESCRIPTION Perl 5.7/8.x for DG/UX ix86 R4.20MU0x =head1 BUILDING PERL ON DG/UX =head2 Non-threaded Perl on DG/UX Just run ./Configure script from the top directory. Then give "make" to compile. =head2 Threaded Perl on DG/UX If you are using as compiler GCC-2.95.x rev(DG/UX) an easy solution for configuring perl in your DG/UX machine is to run the command: ./Configure -Dusethreads -Duseithreads -Dusedevel -des This will automatically accept all the defaults and in particular /usr/local/ as installation directory. Note that GCC-2.95.x rev(DG/UX) knows the switch -pthread which allows it to link correctly DG/UX's -lthread library. If you want to change the installation directory or have a standard DG/UX with C compiler GCC-2.7.2.x then you have no choice than to do an interactive build by issuing the command: ./Configure -Dusethreads -Duseithreads In particular with GCC-2.7.2.x accept all the defaults and *watch* out for the message: Any additional ld flags (NOT including libraries)? [ -pthread] Instead of -pthread put here -lthread. CGCC-2.7.2.x that comes with the DG/UX OS does NOT know the -pthread switch. So your build will fail if you choose the defaults. After configuration is done correctly give "make" to compile. =head2 Testing Perl on DG/UX Issuing a "make test" will run all the tests. If the test lib/ftmp-security gives you as a result something like lib/ftmp-security....File::Temp::_gettemp: Parent directory (/tmp/) is not safe (sticky bit not set when world writable?) at lib/ftmp-security.t line 100 don't panic and just set the sticky bit in your /tmp directory by doing the following as root: cd / chmod +t /tmp (=set the sticky bit to /tmp). Then rerun the tests. This time all must be OK. =head2 Installing the built perl on DG/UX Run the command "make install" =head1 AUTHOR Takis Psarogiannakopoulos University of Cambridge Centre for Mathematical Sciences Department of Pure Mathematics Wilberforce road Cambridge CB3 0WB , UK email =head1 SEE ALSO perl(1). =cut perlrebackslash.pod000064400000063223147633747260010446 0ustar00=head1 NAME perlrebackslash - Perl Regular Expression Backslash Sequences and Escapes =head1 DESCRIPTION The top level documentation about Perl regular expressions is found in L. This document describes all backslash and escape sequences. After explaining the role of the backslash, it lists all the sequences that have a special meaning in Perl regular expressions (in alphabetical order), then describes each of them. Most sequences are described in detail in different documents; the primary purpose of this document is to have a quick reference guide describing all backslash and escape sequences. =head2 The backslash In a regular expression, the backslash can perform one of two tasks: it either takes away the special meaning of the character following it (for instance, C<\|> matches a vertical bar, it's not an alternation), or it is the start of a backslash or escape sequence. The rules determining what it is are quite simple: if the character following the backslash is an ASCII punctuation (non-word) character (that is, anything that is not a letter, digit, or underscore), then the backslash just takes away any special meaning of the character following it. If the character following the backslash is an ASCII letter or an ASCII digit, then the sequence may be special; if so, it's listed below. A few letters have not been used yet, so escaping them with a backslash doesn't change them to be special. A future version of Perl may assign a special meaning to them, so if you have warnings turned on, Perl issues a warning if you use such a sequence. [1]. It is however guaranteed that backslash or escape sequences never have a punctuation character following the backslash, not now, and not in a future version of Perl 5. So it is safe to put a backslash in front of a non-word character. Note that the backslash itself is special; if you want to match a backslash, you have to escape the backslash with a backslash: C matches a single backslash. =over 4 =item [1] There is one exception. If you use an alphanumeric character as the delimiter of your pattern (which you probably shouldn't do for readability reasons), you have to escape the delimiter if you want to match it. Perl won't warn then. See also L. =back =head2 All the sequences and escapes Those not usable within a bracketed character class (like C<[\da-z]>) are marked as C \000 Octal escape sequence. See also \o{}. \1 Absolute backreference. Not in []. \a Alarm or bell. \A Beginning of string. Not in []. \b Word/non-word boundary. (Backspace in []). \B Not a word/non-word boundary. Not in []. \cX Control-X \C Single octet, even under UTF-8. Not in []. \d Character class for digits. \D Character class for non-digits. \e Escape character. \E Turn off \Q, \L and \U processing. Not in []. \f Form feed. \F Foldcase till \E. Not in []. \g{}, \g1 Named, absolute or relative backreference. Not in [] \G Pos assertion. Not in []. \h Character class for horizontal whitespace. \H Character class for non horizontal whitespace. \k{}, \k<>, \k'' Named backreference. Not in []. \K Keep the stuff left of \K. Not in []. \l Lowercase next character. Not in []. \L Lowercase till \E. Not in []. \n (Logical) newline character. \N Any character but newline. Experimental. Not in []. \N{} Named or numbered (Unicode) character or sequence. \o{} Octal escape sequence. \p{}, \pP Character with the given Unicode property. \P{}, \PP Character without the given Unicode property. \Q Quote (disable) pattern metacharacters till \E. Not in []. \r Return character. \R Generic new line. Not in []. \s Character class for whitespace. \S Character class for non whitespace. \t Tab character. \u Titlecase next character. Not in []. \U Uppercase till \E. Not in []. \v Character class for vertical whitespace. \V Character class for non vertical whitespace. \w Character class for word characters. \W Character class for non-word characters. \x{}, \x00 Hexadecimal escape sequence. \X Unicode "extended grapheme cluster". Not in []. \z End of string. Not in []. \Z End of string. Not in []. =head2 Character Escapes =head3 Fixed characters A handful of characters have a dedicated I. The following table shows them, along with their ASCII code points (in decimal and hex), their ASCII name, the control escape on ASCII platforms and a short description. (For EBCDIC platforms, see L.) Seq. Code Point ASCII Cntrl Description. Dec Hex \a 7 07 BEL \cG alarm or bell \b 8 08 BS \cH backspace [1] \e 27 1B ESC \c[ escape character \f 12 0C FF \cL form feed \n 10 0A LF \cJ line feed [2] \r 13 0D CR \cM carriage return \t 9 09 TAB \cI tab =over 4 =item [1] C<\b> is the backspace character only inside a character class. Outside a character class, C<\b> is a word/non-word boundary. =item [2] C<\n> matches a logical newline. Perl converts between C<\n> and your OS's native newline character when reading from or writing to text files. =back =head4 Example $str =~ /\t/; # Matches if $str contains a (horizontal) tab. =head3 Control characters C<\c> is used to denote a control character; the character following C<\c> determines the value of the construct. For example the value of C<\cA> is C, and the value of C<\cb> is C, etc. The gory details are in L. A complete list of what C, etc. means for ASCII and EBCDIC platforms is in L. Note that C<\c\> alone at the end of a regular expression (or doubled-quoted string) is not valid. The backslash must be followed by another character. That is, C<\c\I> means C'> for all characters I. To write platform-independent code, you must use C<\N{I}> instead, like C<\N{ESCAPE}> or C<\N{U+001B}>, see L. Mnemonic: Iontrol character. =head4 Example $str =~ /\cK/; # Matches if $str contains a vertical tab (control-K). =head3 Named or numbered characters and character sequences Unicode characters have a Unicode name and numeric code point (ordinal) value. Use the C<\N{}> construct to specify a character by either of these values. Certain sequences of characters also have names. To specify by name, the name of the character or character sequence goes between the curly braces. To specify a character by Unicode code point, use the form C<\N{U+I}>, where I is a number in hexadecimal that gives the code point that Unicode has assigned to the desired character. It is customary but not required to use leading zeros to pad the number to 4 digits. Thus C<\N{U+0041}> means C, and you will rarely see it written without the two leading zeros. C<\N{U+0041}> means "A" even on EBCDIC machines (where the ordinal value of "A" is not 0x41). It is even possible to give your own names to characters and character sequences. For details, see L. (There is an expanded internal form that you may see in debug output: C<\N{U+I.I...}>. The C<...> means any number of these Is separated by dots. This represents the sequence formed by the characters. This is an internal form only, subject to change, and you should not try to use it yourself.) Mnemonic: Iamed character. Note that a character or character sequence expressed as a named or numbered character is considered a character without special meaning by the regex engine, and will match "as is". =head4 Example $str =~ /\N{THAI CHARACTER SO SO}/; # Matches the Thai SO SO character use charnames 'Cyrillic'; # Loads Cyrillic names. $str =~ /\N{ZHE}\N{KA}/; # Match "ZHE" followed by "KA". =head3 Octal escapes There are two forms of octal escapes. Each is used to specify a character by its code point specified in octal notation. One form, available starting in Perl 5.14 looks like C<\o{...}>, where the dots represent one or more octal digits. It can be used for any Unicode character. It was introduced to avoid the potential problems with the other form, available in all Perls. That form consists of a backslash followed by three octal digits. One problem with this form is that it can look exactly like an old-style backreference (see L below.) You can avoid this by making the first of the three digits always a zero, but that makes \077 the largest code point specifiable. In some contexts, a backslash followed by two or even one octal digits may be interpreted as an octal escape, sometimes with a warning, and because of some bugs, sometimes with surprising results. Also, if you are creating a regex out of smaller snippets concatenated together, and you use fewer than three digits, the beginning of one snippet may be interpreted as adding digits to the ending of the snippet before it. See L for more discussion and examples of the snippet problem. Note that a character expressed as an octal escape is considered a character without special meaning by the regex engine, and will match "as is". To summarize, the C<\o{}> form is always safe to use, and the other form is safe to use for code points through \077 when you use exactly three digits to specify them. Mnemonic: I<0>ctal or Ictal. =head4 Examples (assuming an ASCII platform) $str = "Perl"; $str =~ /\o{120}/; # Match, "\120" is "P". $str =~ /\120/; # Same. $str =~ /\o{120}+/; # Match, "\120" is "P", it's repeated at least once $str =~ /\120+/; # Same. $str =~ /P\053/; # No match, "\053" is "+" and taken literally. /\o{23073}/ # Black foreground, white background smiling face. /\o{4801234567}/ # Raises a warning, and yields chr(4) =head4 Disambiguation rules between old-style octal escapes and backreferences Octal escapes of the C<\000> form outside of bracketed character classes potentially clash with old-style backreferences. (see L below). They both consist of a backslash followed by numbers. So Perl has to use heuristics to determine whether it is a backreference or an octal escape. Perl uses the following rules to disambiguate: =over 4 =item 1 If the backslash is followed by a single digit, it's a backreference. =item 2 If the first digit following the backslash is a 0, it's an octal escape. =item 3 If the number following the backslash is N (in decimal), and Perl already has seen N capture groups, Perl considers this a backreference. Otherwise, it considers it an octal escape. If N has more than three digits, Perl takes only the first three for the octal escape; the rest are matched as is. my $pat = "(" x 999; $pat .= "a"; $pat .= ")" x 999; /^($pat)\1000$/; # Matches 'aa'; there are 1000 capture groups. /^$pat\1000$/; # Matches 'a@0'; there are 999 capture groups # and \1000 is seen as \100 (a '@') and a '0' =back You can force a backreference interpretation always by using the C<\g{...}> form. You can the force an octal interpretation always by using the C<\o{...}> form, or for numbers up through \077 (= 63 decimal), by using three digits, beginning with a "0". =head3 Hexadecimal escapes Like octal escapes, there are two forms of hexadecimal escapes, but both start with the same thing, C<\x>. This is followed by either exactly two hexadecimal digits forming a number, or a hexadecimal number of arbitrary length surrounded by curly braces. The hexadecimal number is the code point of the character you want to express. Note that a character expressed as one of these escapes is considered a character without special meaning by the regex engine, and will match "as is". Mnemonic: heIadecimal. =head4 Examples (assuming an ASCII platform) $str = "Perl"; $str =~ /\x50/; # Match, "\x50" is "P". $str =~ /\x50+/; # Match, "\x50" is "P", it is repeated at least once $str =~ /P\x2B/; # No match, "\x2B" is "+" and taken literally. /\x{2603}\x{2602}/ # Snowman with an umbrella. # The Unicode character 2603 is a snowman, # the Unicode character 2602 is an umbrella. /\x{263B}/ # Black smiling face. /\x{263b}/ # Same, the hex digits A - F are case insensitive. =head2 Modifiers A number of backslash sequences have to do with changing the character, or characters following them. C<\l> will lowercase the character following it, while C<\u> will uppercase (or, more accurately, titlecase) the character following it. They provide functionality similar to the functions C and C. To uppercase or lowercase several characters, one might want to use C<\L> or C<\U>, which will lowercase/uppercase all characters following them, until either the end of the pattern or the next occurrence of C<\E>, whichever comes first. They provide functionality similar to what the functions C and C provide. C<\Q> is used to quote (disable) pattern metacharacters, up to the next C<\E> or the end of the pattern. C<\Q> adds a backslash to any character that could have special meaning to Perl. In the ASCII range, it quotes every character that isn't a letter, digit, or underscore. See L for details on what gets quoted for non-ASCII code points. Using this ensures that any character between C<\Q> and C<\E> will be matched literally, not interpreted as a metacharacter by the regex engine. C<\F> can be used to casefold all characters following, up to the next C<\E> or the end of the pattern. It provides the functionality similar to the C function. Mnemonic: Iowercase, Ippercase, Iold-case, Iuotemeta, Ind. =head4 Examples $sid = "sid"; $greg = "GrEg"; $miranda = "(Miranda)"; $str =~ /\u$sid/; # Matches 'Sid' $str =~ /\L$greg/; # Matches 'greg' $str =~ /\Q$miranda\E/; # Matches '(Miranda)', as if the pattern # had been written as /\(Miranda\)/ =head2 Character classes Perl regular expressions have a large range of character classes. Some of the character classes are written as a backslash sequence. We will briefly discuss those here; full details of character classes can be found in L. C<\w> is a character class that matches any single I character (letters, digits, Unicode marks, and connector punctuation (like the underscore)). C<\d> is a character class that matches any decimal digit, while the character class C<\s> matches any whitespace character. New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal and vertical whitespace characters. The exact set of characters matched by C<\d>, C<\s>, and C<\w> varies depending on various pragma and regular expression modifiers. It is possible to restrict the match to the ASCII range by using the C regular expression modifier. See L. The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are character classes that match, respectively, any character that isn't a word character, digit, whitespace, horizontal whitespace, or vertical whitespace. Mnemonics: Iord, Iigit, Ipace, Iorizontal, Iertical. =head3 Unicode classes C<\pP> (where C

is a single letter) and C<\p{Property}> are used to match a character that matches the given Unicode property; properties include things like "letter", or "thai character". Capitalizing the sequence to C<\PP> and C<\P{Property}> make the sequence match a character that doesn't match the given Unicode property. For more details, see L and L. Mnemonic: I

roperty. =head2 Referencing If capturing parenthesis are used in a regular expression, we can refer to the part of the source string that was matched, and match exactly the same thing. There are three ways of referring to such I: absolutely, relatively, and by name. =for later add link to perlrecapture =head3 Absolute referencing Either C<\gI> (starting in Perl 5.10.0), or C<\I> (old-style) where I is a positive (unsigned) decimal number of any length is an absolute reference to a capturing group. I refers to the Nth set of parentheses, so C<\gI> refers to whatever has been matched by that set of parentheses. Thus C<\g1> refers to the first capture group in the regex. The C<\gI> form can be equivalently written as C<\g{I}> which avoids ambiguity when building a regex by concatenating shorter strings. Otherwise if you had a regex C, and C<$a> contained C<"\g1">, and C<$b> contained C<"37">, you would get C which is probably not what you intended. In the C<\I> form, I must not begin with a "0", and there must be at least I capturing groups, or else I is considered an octal escape (but something like C<\18> is the same as C<\0018>; that is, the octal escape C<"\001"> followed by a literal digit C<"8">). Mnemonic: Iroup. =head4 Examples /(\w+) \g1/; # Finds a duplicated word, (e.g. "cat cat"). /(\w+) \1/; # Same thing; written old-style /(.)(.)\g2\g1/; # Match a four letter palindrome (e.g. "ABBA"). =head3 Relative referencing C<\g-I> (starting in Perl 5.10.0) is used for relative addressing. (It can be written as C<\g{-I>.) It refers to the Ith group before the C<\g{-I}>. The big advantage of this form is that it makes it much easier to write patterns with references that can be interpolated in larger patterns, even if the larger pattern also contains capture groups. =head4 Examples /(A) # Group 1 ( # Group 2 (B) # Group 3 \g{-1} # Refers to group 3 (B) \g{-3} # Refers to group 1 (A) ) /x; # Matches "ABBA". my $qr = qr /(.)(.)\g{-2}\g{-1}/; # Matches 'abab', 'cdcd', etc. /$qr$qr/ # Matches 'ababcdcd'. =head3 Named referencing C<\g{I}> (starting in Perl 5.10.0) can be used to back refer to a named capture group, dispensing completely with having to think about capture buffer positions. To be compatible with .Net regular expressions, C<\g{name}> may also be written as C<\k{name}>, C<< \k >> or C<\k'name'>. To prevent any ambiguity, I must not start with a digit nor contain a hyphen. =head4 Examples /(?\w+) \g{word}/ # Finds duplicated word, (e.g. "cat cat") /(?\w+) \k{word}/ # Same. /(?\w+) \k/ # Same. /(?.)(?.)\g{letter2}\g{letter1}/ # Match a four letter palindrome (e.g. "ABBA") =head2 Assertions Assertions are conditions that have to be true; they don't actually match parts of the substring. There are six assertions that are written as backslash sequences. =over 4 =item \A C<\A> only matches at the beginning of the string. If the C modifier isn't used, then C is equivalent to C. However, if the C modifier is used, then C matches internal newlines, but the meaning of C isn't changed by the C modifier. C<\A> matches at the beginning of the string regardless whether the C modifier is used. =item \z, \Z C<\z> and C<\Z> match at the end of the string. If the C modifier isn't used, then C is equivalent to C; that is, it matches at the end of the string, or one before the newline at the end of the string. If the C modifier is used, then C matches at internal newlines, but the meaning of C isn't changed by the C modifier. C<\Z> matches at the end of the string (or just before a trailing newline) regardless whether the C modifier is used. C<\z> is just like C<\Z>, except that it does not match before a trailing newline. C<\z> matches at the end of the string only, regardless of the modifiers used, and not just before a newline. It is how to anchor the match to the true end of the string under all conditions. =item \G C<\G> is usually used only in combination with the C modifier. If the C modifier is used and the match is done in scalar context, Perl remembers where in the source string the last match ended, and the next time, it will start the match from where it ended the previous time. C<\G> matches the point where the previous match on that string ended, or the beginning of that string if there was no previous match. =for later add link to perlremodifiers Mnemonic: Ilobal. =item \b, \B C<\b> matches at any place between a word and a non-word character; C<\B> matches at any place between characters where C<\b> doesn't match. C<\b> and C<\B> assume there's a non-word character before the beginning and after the end of the source string; so C<\b> will match at the beginning (or end) of the source string if the source string begins (or ends) with a word character. Otherwise, C<\B> will match. Do not use something like C<\b=head\d\b> and expect it to match the beginning of a line. It can't, because for there to be a boundary before the non-word "=", there must be a word character immediately previous. All boundary determinations look for word characters alone, not for non-words characters nor for string ends. It may help to understand how <\b> and <\B> work by equating them as follows: \b really means (?:(?<=\w)(?!\w)|(?oundary. =back =head4 Examples "cat" =~ /\Acat/; # Match. "cat" =~ /cat\Z/; # Match. "cat\n" =~ /cat\Z/; # Match. "cat\n" =~ /cat\z/; # No match. "cat" =~ /\bcat\b/; # Matches. "cats" =~ /\bcat\b/; # No match. "cat" =~ /\bcat\B/; # No match. "cats" =~ /\bcat\B/; # Match. while ("cat dog" =~ /(\w+)/g) { print $1; # Prints 'catdog' } while ("cat dog" =~ /\G(\w+)/g) { print $1; # Prints 'cat' } =head2 Misc Here we document the backslash sequences that don't fall in one of the categories above. These are: =over 4 =item \C C<\C> always matches a single octet, even if the source string is encoded in UTF-8 format, and the character to be matched is a multi-octet character. C<\C> was introduced in perl 5.6. This is very dangerous, because it violates the logical character abstraction and can cause UTF-8 sequences to become malformed. Mnemonic: oItet. =item \K This appeared in perl 5.10.0. Anything matched left of C<\K> is not included in C<$&>, and will not be replaced if the pattern is used in a substitution. This lets you write C instead of C or C. Mnemonic: Ieep. =item \N This is an experimental feature new to perl 5.12.0. It matches any character that is B a newline. It is a short-hand for writing C<[^\n]>, and is identical to the C<.> metasymbol, except under the C flag, which changes the meaning of C<.>, but not C<\N>. Note that C<\N{...}> can mean a L. Mnemonic: Complement of I<\n>. =item \R X<\R> C<\R> matches a I; that is, anything considered a linebreak sequence by Unicode. This includes all characters matched by C<\v> (vertical whitespace), and the multi character sequence C<"\x0D\x0A"> (carriage return followed by a line feed, sometimes called the network newline; it's the end of line sequence used in Microsoft text files opened in binary mode). C<\R> is equivalent to C<< (?>\x0D\x0A|\v) >>. (The reason it doesn't backtrack is that the sequence is considered inseparable. That means that "\x0D\x0A" =~ /^\R\x0A$/ # No match fails, because the C<\R> matches the entire string, and won't backtrack to match just the C<"\x0D">.) Since C<\R> can match a sequence of more than one character, it cannot be put inside a bracketed character class; C is an error; use C<\v> instead. C<\R> was introduced in perl 5.10.0. Note that this does not respect any locale that might be in effect; it matches according to the platform's native character set. Mnemonic: none really. C<\R> was picked because PCRE already uses C<\R>, and more importantly because Unicode recommends such a regular expression metacharacter, and suggests C<\R> as its notation. =item \X X<\X> This matches a Unicode I. C<\X> matches quite well what normal (non-Unicode-programmer) usage would consider a single character. As an example, consider a G with some sort of diacritic mark, such as an arrow. There is no such single character in Unicode, but one can be composed by using a G followed by a Unicode "COMBINING UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it were a single character. Mnemonic: eItended Unicode character. =back =head4 Examples "\x{256}" =~ /^\C\C$/; # Match as chr (0x256) takes 2 octets in UTF-8. $str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz' $str =~ s/(.)\K\g1//g; # Delete duplicated characters. "\n" =~ /^\R$/; # Match, \n is a generic newline. "\r" =~ /^\R$/; # Match, \r is a generic newline. "\r\n" =~ /^\R$/; # Match, \r\n is a generic newline. "P\x{307}" =~ /^\X$/ # \X matches a P with a dot above. =cut perldebguts.pod000064400000111450147633747260007615 0ustar00=head1 NAME perldebguts - Guts of Perl debugging =head1 DESCRIPTION This is not L, which tells you how to use the debugger. This manpage describes low-level details concerning the debugger's internals, which range from difficult to impossible to understand for anyone who isn't incredibly intimate with Perl's guts. Caveat lector. =head1 Debugger Internals Perl has special debugging hooks at compile-time and run-time used to create debugging environments. These hooks are not to be confused with the I command described in L, which is usable only if a special Perl is built per the instructions in the F podpage in the Perl source tree. For example, whenever you call Perl's built-in C function from the package C, the arguments that the corresponding stack frame was called with are copied to the C<@DB::args> array. These mechanisms are enabled by calling Perl with the B<-d> switch. Specifically, the following additional features are enabled (cf. L): =over 4 =item * Perl inserts the contents of C<$ENV{PERL5DB}> (or C if not present) before the first line of your program. =item * Each array C<@{"_<$filename"}> holds the lines of $filename for a file compiled by Perl. The same is also true for Ced strings that contain subroutines, or which are currently being executed. The $filename for Ced strings looks like C<(eval 34)>. Code assertions in regexes look like C<(re_eval 19)>. Values in this array are magical in numeric context: they compare equal to zero only if the line is not breakable. =item * Each hash C<%{"_<$filename"}> contains breakpoints and actions keyed by line number. Individual entries (as opposed to the whole hash) are settable. Perl only cares about Boolean true here, although the values used by F have the form C<"$break_condition\0$action">. The same holds for evaluated strings that contain subroutines, or which are currently being executed. The $filename for Ced strings looks like C<(eval 34)> or C<(re_eval 19)>. =item * Each scalar C<${"_<$filename"}> contains C<"_<$filename">. This is also the case for evaluated strings that contain subroutines, or which are currently being executed. The $filename for Ced strings looks like C<(eval 34)> or C<(re_eval 19)>. =item * After each Cd file is compiled, but before it is executed, C is called if the subroutine C exists. Here, the $filename is the expanded name of the Cd file, as found in the values of %INC. =item * After each subroutine C is compiled, the existence of C<$DB::postponed{subname}> is checked. If this key exists, C is called if the C subroutine also exists. =item * A hash C<%DB::sub> is maintained, whose keys are subroutine names and whose values have the form C. C has the form C<(eval 34)> for subroutines defined inside Cs, or C<(re_eval 19)> for those within regex code assertions. =item * When the execution of your program reaches a point that can hold a breakpoint, the C subroutine is called if any of the variables C<$DB::trace>, C<$DB::single>, or C<$DB::signal> is true. These variables are not Cizable. This feature is disabled when executing inside C, including functions called from it unless C<< $^D & (1<<30) >> is true. =item * When execution of the program reaches a subroutine call, a call to C<&DB::sub>(I) is made instead, with C<$DB::sub> holding the name of the called subroutine. (This doesn't happen if the subroutine was compiled in the C package.) =back Note that if C<&DB::sub> needs external data for it to work, no subroutine call is possible without it. As an example, the standard debugger's C<&DB::sub> depends on the C<$DB::deep> variable (it defines how many levels of recursion deep into the debugger you can go before a mandatory break). If C<$DB::deep> is not defined, subroutine calls are not possible, even though C<&DB::sub> exists. =head2 Writing Your Own Debugger =head3 Environment Variables The C environment variable can be used to define a debugger. For example, the minimal "working" debugger (it actually doesn't do anything) consists of one line: sub DB::DB {} It can easily be defined like this: $ PERL5DB="sub DB::DB {}" perl -d your-script Another brief debugger, slightly more useful, can be created with only the line: sub DB::DB {print ++$i; scalar } This debugger prints a number which increments for each statement encountered and waits for you to hit a newline before continuing to the next statement. The following debugger is actually useful: { package DB; sub DB {} sub sub {print ++$i, " $sub\n"; &$sub} } It prints the sequence number of each subroutine call and the name of the called subroutine. Note that C<&DB::sub> is being compiled into the package C through the use of the C directive. When it starts, the debugger reads your rc file (F<./.perldb> or F<~/.perldb> under Unix), which can set important options. (A subroutine (C<&afterinit>) can be defined here as well; it is executed after the debugger completes its own initialization.) After the rc file is read, the debugger reads the PERLDB_OPTS environment variable and uses it to set debugger options. The contents of this variable are treated as if they were the argument of an C debugger command (q.v. in L). =head3 Debugger Internal Variables In addition to the file and subroutine-related variables mentioned above, the debugger also maintains various magical internal variables. =over 4 =item * C<@DB::dbline> is an alias for C<@{"::_, which holds the lines of the currently-selected file (compiled by Perl), either explicitly chosen with the debugger's C command, or implicitly by flow of execution. Values in this array are magical in numeric context: they compare equal to zero only if the line is not breakable. =item * C<%DB::dbline> is an alias for C<%{"::_, which contains breakpoints and actions keyed by line number in the currently-selected file, either explicitly chosen with the debugger's C command, or implicitly by flow of execution. As previously noted, individual entries (as opposed to the whole hash) are settable. Perl only cares about Boolean true here, although the values used by F have the form C<"$break_condition\0$action">. =back =head3 Debugger Customization Functions Some functions are provided to simplify customization. =over 4 =item * See L for a description of options parsed by C. =item * C skips the specified number of frames and returns a list containing information about the calling frames (all of them, if C is missing). Each entry is reference to a hash with keys C (either C<.>, C<$>, or C<@>), C (subroutine name, or info about C), C (C or a reference to an array), C, and C. =item * C prints formatted info about caller frames. The last two functions may be convenient as arguments to C<< < >>, C<< << >> commands. =back Note that any variables and functions that are not documented in this manpages (or in L) are considered for internal use only, and as such are subject to change without notice. =head1 Frame Listing Output Examples The C option can be used to control the output of frame information. For example, contrast this expression trace: $ perl -de 42 Stack dump during die enabled outside of evals. Loading DB routines from perl5db.pl patch level 0.94 Emacs support available. Enter h or 'h h' for help. main::(-e:1): 0 DB<1> sub foo { 14 } DB<2> sub bar { 3 } DB<3> t print foo() * bar() main::((eval 172):3): print foo() + bar(); main::foo((eval 168):2): main::bar((eval 170):2): 42 with this one, once the Cption C has been set: DB<4> o f=2 frame = '2' DB<5> t print foo() * bar() 3: foo() * bar() entering main::foo 2: sub foo { 14 }; exited main::foo entering main::bar 2: sub bar { 3 }; exited main::bar 42 By way of demonstration, we present below a laborious listing resulting from setting your C environment variable to the value C, and running I from the command line. Examples using various values of C are shown to give you a feel for the difference between settings. Long though it may be, this is not a complete listing, but only excerpts. =over 4 =item 1 entering main::BEGIN entering Config::BEGIN Package lib/Exporter.pm. Package lib/Carp.pm. Package lib/Config.pm. entering Config::TIEHASH entering Exporter::import entering Exporter::export entering Config::myconfig entering Config::FETCH entering Config::FETCH entering Config::FETCH entering Config::FETCH =item 2 entering main::BEGIN entering Config::BEGIN Package lib/Exporter.pm. Package lib/Carp.pm. exited Config::BEGIN Package lib/Config.pm. entering Config::TIEHASH exited Config::TIEHASH entering Exporter::import entering Exporter::export exited Exporter::export exited Exporter::import exited main::BEGIN entering Config::myconfig entering Config::FETCH exited Config::FETCH entering Config::FETCH exited Config::FETCH entering Config::FETCH =item 3 in $=main::BEGIN() from /dev/null:0 in $=Config::BEGIN() from lib/Config.pm:2 Package lib/Exporter.pm. Package lib/Carp.pm. Package lib/Config.pm. in $=Config::TIEHASH('Config') from lib/Config.pm:644 in $=Exporter::import('Config', 'myconfig', 'config_vars') from /dev/null:0 in $=Exporter::export('Config', 'main', 'myconfig', 'config_vars') from li in @=Config::myconfig() from /dev/null:0 in $=Config::FETCH(ref(Config), 'package') from lib/Config.pm:574 in $=Config::FETCH(ref(Config), 'baserev') from lib/Config.pm:574 in $=Config::FETCH(ref(Config), 'PERL_VERSION') from lib/Config.pm:574 in $=Config::FETCH(ref(Config), 'PERL_SUBVERSION') from lib/Config.pm:574 in $=Config::FETCH(ref(Config), 'osname') from lib/Config.pm:574 in $=Config::FETCH(ref(Config), 'osvers') from lib/Config.pm:574 =item 4 in $=main::BEGIN() from /dev/null:0 in $=Config::BEGIN() from lib/Config.pm:2 Package lib/Exporter.pm. Package lib/Carp.pm. out $=Config::BEGIN() from lib/Config.pm:0 Package lib/Config.pm. in $=Config::TIEHASH('Config') from lib/Config.pm:644 out $=Config::TIEHASH('Config') from lib/Config.pm:644 in $=Exporter::import('Config', 'myconfig', 'config_vars') from /dev/null:0 in $=Exporter::export('Config', 'main', 'myconfig', 'config_vars') from lib/ out $=Exporter::export('Config', 'main', 'myconfig', 'config_vars') from lib/ out $=Exporter::import('Config', 'myconfig', 'config_vars') from /dev/null:0 out $=main::BEGIN() from /dev/null:0 in @=Config::myconfig() from /dev/null:0 in $=Config::FETCH(ref(Config), 'package') from lib/Config.pm:574 out $=Config::FETCH(ref(Config), 'package') from lib/Config.pm:574 in $=Config::FETCH(ref(Config), 'baserev') from lib/Config.pm:574 out $=Config::FETCH(ref(Config), 'baserev') from lib/Config.pm:574 in $=Config::FETCH(ref(Config), 'PERL_VERSION') from lib/Config.pm:574 out $=Config::FETCH(ref(Config), 'PERL_VERSION') from lib/Config.pm:574 in $=Config::FETCH(ref(Config), 'PERL_SUBVERSION') from lib/Config.pm:574 =item 5 in $=main::BEGIN() from /dev/null:0 in $=Config::BEGIN() from lib/Config.pm:2 Package lib/Exporter.pm. Package lib/Carp.pm. out $=Config::BEGIN() from lib/Config.pm:0 Package lib/Config.pm. in $=Config::TIEHASH('Config') from lib/Config.pm:644 out $=Config::TIEHASH('Config') from lib/Config.pm:644 in $=Exporter::import('Config', 'myconfig', 'config_vars') from /dev/null:0 in $=Exporter::export('Config', 'main', 'myconfig', 'config_vars') from lib/E out $=Exporter::export('Config', 'main', 'myconfig', 'config_vars') from lib/E out $=Exporter::import('Config', 'myconfig', 'config_vars') from /dev/null:0 out $=main::BEGIN() from /dev/null:0 in @=Config::myconfig() from /dev/null:0 in $=Config::FETCH('Config=HASH(0x1aa444)', 'package') from lib/Config.pm:574 out $=Config::FETCH('Config=HASH(0x1aa444)', 'package') from lib/Config.pm:574 in $=Config::FETCH('Config=HASH(0x1aa444)', 'baserev') from lib/Config.pm:574 out $=Config::FETCH('Config=HASH(0x1aa444)', 'baserev') from lib/Config.pm:574 =item 6 in $=CODE(0x15eca4)() from /dev/null:0 in $=CODE(0x182528)() from lib/Config.pm:2 Package lib/Exporter.pm. out $=CODE(0x182528)() from lib/Config.pm:0 scalar context return from CODE(0x182528): undef Package lib/Config.pm. in $=Config::TIEHASH('Config') from lib/Config.pm:628 out $=Config::TIEHASH('Config') from lib/Config.pm:628 scalar context return from Config::TIEHASH: empty hash in $=Exporter::import('Config', 'myconfig', 'config_vars') from /dev/null:0 in $=Exporter::export('Config', 'main', 'myconfig', 'config_vars') from lib/Exporter.pm:171 out $=Exporter::export('Config', 'main', 'myconfig', 'config_vars') from lib/Exporter.pm:171 scalar context return from Exporter::export: '' out $=Exporter::import('Config', 'myconfig', 'config_vars') from /dev/null:0 scalar context return from Exporter::import: '' =back In all cases shown above, the line indentation shows the call tree. If bit 2 of C is set, a line is printed on exit from a subroutine as well. If bit 4 is set, the arguments are printed along with the caller info. If bit 8 is set, the arguments are printed even if they are tied or references. If bit 16 is set, the return value is printed, too. When a package is compiled, a line like this Package lib/Carp.pm. is printed with proper indentation. =head1 Debugging Regular Expressions There are two ways to enable debugging output for regular expressions. If your perl is compiled with C<-DDEBUGGING>, you may use the B<-Dr> flag on the command line. Otherwise, one can C, which has effects at compile time and run time. Since Perl 5.9.5, this pragma is lexically scoped. =head2 Compile-time Output The debugging output at compile time looks like this: Compiling REx '[bc]d(ef*g)+h[ij]k$' size 45 Got 364 bytes for offset annotations. first at 1 rarest char g at 0 rarest char d at 0 1: ANYOF[bc](12) 12: EXACT (14) 14: CURLYX[0] {1,32767}(28) 16: OPEN1(18) 18: EXACT (20) 20: STAR(23) 21: EXACT (0) 23: EXACT (25) 25: CLOSE1(27) 27: WHILEM[1/1](0) 28: NOTHING(29) 29: EXACT (31) 31: ANYOF[ij](42) 42: EXACT (44) 44: EOL(45) 45: END(0) anchored 'de' at 1 floating 'gh' at 3..2147483647 (checking floating) stclass 'ANYOF[bc]' minlen 7 Offsets: [45] 1[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 5[1] 0[0] 12[1] 0[0] 6[1] 0[0] 7[1] 0[0] 9[1] 8[1] 0[0] 10[1] 0[0] 11[1] 0[0] 12[0] 12[0] 13[1] 0[0] 14[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 18[1] 0[0] 19[1] 20[0] Omitting $` $& $' support. The first line shows the pre-compiled form of the regex. The second shows the size of the compiled form (in arbitrary units, usually 4-byte words) and the total number of bytes allocated for the offset/length table, usually 4+C*8. The next line shows the label I of the first node that does a match. The anchored 'de' at 1 floating 'gh' at 3..2147483647 (checking floating) stclass 'ANYOF[bc]' minlen 7 line (split into two lines above) contains optimizer information. In the example shown, the optimizer found that the match should contain a substring C at offset 1, plus substring C at some offset between 3 and infinity. Moreover, when checking for these substrings (to abandon impossible matches quickly), Perl will check for the substring C before checking for the substring C. The optimizer may also use the knowledge that the match starts (at the C I) with a character class, and no string shorter than 7 characters can possibly match. The fields of interest which may appear in this line are =over 4 =item C I C I =item C I C I See above. =item C Which substring to check first. =item C The minimal length of the match. =item C I Type of first matching node. =item C Don't scan for the found substrings. =item C Means that the optimizer information is all that the regular expression contains, and thus one does not need to enter the regex engine at all. =item C Set if the pattern contains C<\G>. =item C Set if the pattern starts with a repeated char (as in C). =item C Set if the pattern starts with C<.*>. =item C Set if the pattern contain eval-groups, such as C<(?{ code })> and C<(??{ code })>. =item C If the pattern may match only at a handful of places, with C being C, C, or C. See the table below. =back If a substring is known to match at end-of-line only, it may be followed by C<$>, as in C. The optimizer-specific information is used to avoid entering (a slow) regex engine on strings that will not definitely match. If the C flag is set, a call to the regex engine may be avoided even when the optimizer found an appropriate place for the match. Above the optimizer section is the list of I of the compiled form of the regex. Each line has format C< >I: I I (I) =head2 Types of Nodes Here are the possible types, with short descriptions: # TYPE arg-description [num-args] [longjump-len] DESCRIPTION # Exit points END no End of program. SUCCEED no Return from a subroutine, basically. # Anchors: BOL no Match "" at beginning of line. MBOL no Same, assuming multiline. SBOL no Same, assuming singleline. EOS no Match "" at end of string. EOL no Match "" at end of line. MEOL no Same, assuming multiline. SEOL no Same, assuming singleline. BOUND no Match "" at any word boundary using native charset semantics for non-utf8 BOUNDL no Match "" at any locale word boundary BOUNDU no Match "" at any word boundary using Unicode semantics BOUNDA no Match "" at any word boundary using ASCII semantics NBOUND no Match "" at any word non-boundary using native charset semantics for non-utf8 NBOUNDL no Match "" at any locale word non-boundary NBOUNDU no Match "" at any word non-boundary using Unicode semantics NBOUNDA no Match "" at any word non-boundary using ASCII semantics GPOS no Matches where last m//g left off. # [Special] alternatives: REG_ANY no Match any one character (except newline). SANY no Match any one character. CANY no Match any one byte. ANYOF sv Match character in (or not in) this class, single char match only ANYOFV sv Match character in (or not in) this class, can match-multiple chars ALNUM no Match any alphanumeric character using native charset semantics for non-utf8 ALNUML no Match any alphanumeric char in locale ALNUMU no Match any alphanumeric char using Unicode semantics ALNUMA no Match [A-Za-z_0-9] NALNUM no Match any non-alphanumeric character using native charset semantics for non-utf8 NALNUML no Match any non-alphanumeric char in locale NALNUMU no Match any non-alphanumeric char using Unicode semantics NALNUMA no Match [^A-Za-z_0-9] SPACE no Match any whitespace character using native charset semantics for non-utf8 SPACEL no Match any whitespace char in locale SPACEU no Match any whitespace char using Unicode semantics SPACEA no Match [ \t\n\f\r] NSPACE no Match any non-whitespace character using native charset semantics for non-utf8 NSPACEL no Match any non-whitespace char in locale NSPACEU no Match any non-whitespace char using Unicode semantics NSPACEA no Match [^ \t\n\f\r] DIGIT no Match any numeric character using native charset semantics for non-utf8 DIGITL no Match any numeric character in locale DIGITA no Match [0-9] NDIGIT no Match any non-numeric character using native charset i semantics for non-utf8 NDIGITL no Match any non-numeric character in locale NDIGITA no Match [^0-9] CLUMP no Match any extended grapheme cluster sequence # Alternation # BRANCH The set of branches constituting a single choice are hooked # together with their "next" pointers, since precedence prevents # anything being concatenated to any individual branch. The # "next" pointer of the last BRANCH in a choice points to the # thing following the whole choice. This is also where the # final "next" pointer of each individual branch points; each # branch starts with the operand node of a BRANCH node. # BRANCH node Match this alternative, or the next... # Back pointer # BACK Normal "next" pointers all implicitly point forward; BACK # exists to make loop structures possible. # not used BACK no Match "", "next" ptr points backward. # Literals EXACT str Match this string (preceded by length). EXACTF str Match this string, folded, native charset semantics for non-utf8 (prec. by length). EXACTFL str Match this string, folded in locale (w/len). EXACTFU str Match this string, folded, Unicode semantics for non-utf8 (prec. by length). EXACTFA str Match this string, folded, Unicode semantics for non-utf8, but no ASCII-range character matches outside ASCII (prec. by length),. # Do nothing types NOTHING no Match empty string. # A variant of above which delimits a group, thus stops optimizations TAIL no Match empty string. Can jump here from outside. # Loops # STAR,PLUS '?', and complex '*' and '+', are implemented as circular # BRANCH structures using BACK. Simple cases (one character # per match) are implemented with STAR and PLUS for speed # and to minimize recursive plunges. # STAR node Match this (simple) thing 0 or more times. PLUS node Match this (simple) thing 1 or more times. CURLY sv 2 Match this simple thing {n,m} times. CURLYN no 2 Capture next-after-this simple thing CURLYM no 2 Capture this medium-complex thing {n,m} times. CURLYX sv 2 Match this complex thing {n,m} times. # This terminator creates a loop structure for CURLYX WHILEM no Do curly processing and see if rest matches. # Buffer related # OPEN,CLOSE,GROUPP ...are numbered at compile time. OPEN num 1 Mark this point in input as start of #n. CLOSE num 1 Analogous to OPEN. REF num 1 Match some already matched string REFF num 1 Match already matched string, folded using native charset semantics for non-utf8 REFFL num 1 Match already matched string, folded in loc. REFFU num 1 Match already matched string, folded using unicode semantics for non-utf8 REFFA num 1 Match already matched string, folded using unicode semantics for non-utf8, no mixing ASCII, non-ASCII # Named references. Code in regcomp.c assumes that these all are after the # numbered references NREF no-sv 1 Match some already matched string NREFF no-sv 1 Match already matched string, folded using native charset semantics for non-utf8 NREFFL no-sv 1 Match already matched string, folded in loc. NREFFU num 1 Match already matched string, folded using unicode semantics for non-utf8 NREFFA num 1 Match already matched string, folded using unicode semantics for non-utf8, no mixing ASCII, non-ASCII IFMATCH off 1 2 Succeeds if the following matches. UNLESSM off 1 2 Fails if the following matches. SUSPEND off 1 1 "Independent" sub-RE. IFTHEN off 1 1 Switch, should be preceded by switcher. GROUPP num 1 Whether the group matched. # Support for long RE LONGJMP off 1 1 Jump far away. BRANCHJ off 1 1 BRANCH with long offset. # The heavy worker EVAL evl 1 Execute some Perl code. # Modifiers MINMOD no Next operator is not greedy. LOGICAL no Next opcode should set the flag only. # This is not used yet RENUM off 1 1 Group with independently numbered parens. # Trie Related # Behave the same as A|LIST|OF|WORDS would. The '..C' variants have # inline charclass data (ascii only), the 'C' store it in the structure. # NOTE: the relative order of the TRIE-like regops is significant TRIE trie 1 Match many EXACT(F[ALU]?)? at once. flags==type TRIEC charclass Same as TRIE, but with embedded charclass data # For start classes, contains an added fail table. AHOCORASICK trie 1 Aho Corasick stclass. flags==type AHOCORASICKC charclass Same as AHOCORASICK, but with embedded charclass data # Regex Subroutines GOSUB num/ofs 2L recurse to paren arg1 at (signed) ofs arg2 GOSTART no recurse to start of pattern # Special conditionals NGROUPP no-sv 1 Whether the group matched. INSUBP num 1 Whether we are in a specific recurse. DEFINEP none 1 Never execute directly. # Backtracking Verbs ENDLIKE none Used only for the type field of verbs OPFAIL none Same as (?!) ACCEPT parno 1 Accepts the current matched string. # Verbs With Arguments VERB no-sv 1 Used only for the type field of verbs PRUNE no-sv 1 Pattern fails at this startpoint if no-backtracking through this MARKPOINT no-sv 1 Push the current location for rollback by cut. SKIP no-sv 1 On failure skip forward (to the mark) before retrying COMMIT no-sv 1 Pattern fails outright if backtracking through this CUTGROUP no-sv 1 On failure go to the next alternation in the group # Control what to keep in $&. KEEPS no $& begins here. # New charclass like patterns LNBREAK none generic newline pattern VERTWS none vertical whitespace (Perl 6) NVERTWS none not vertical whitespace (Perl 6) HORIZWS none horizontal whitespace (Perl 6) NHORIZWS none not horizontal whitespace (Perl 6) FOLDCHAR codepoint 1 codepoint with tricky case folding properties. # SPECIAL REGOPS # This is not really a node, but an optimized away piece of a "long" node. # To simplify debugging output, we mark it as if it were a node OPTIMIZED off Placeholder for dump. # Special opcode with the property that no opcode in a compiled program # will ever be of this type. Thus it can be used as a flag value that # no other opcode has been seen. END is used similarly, in that an END # node cant be optimized. So END implies "unoptimizable" and PSEUDO mean # "not seen anything to optimize yet". PSEUDO off Pseudo opcode for internal use. =for unprinted-credits Next section M-J. Dominus (mjd-perl-patch+@plover.com) 20010421 Following the optimizer information is a dump of the offset/length table, here split across several lines: Offsets: [45] 1[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 5[1] 0[0] 12[1] 0[0] 6[1] 0[0] 7[1] 0[0] 9[1] 8[1] 0[0] 10[1] 0[0] 11[1] 0[0] 12[0] 12[0] 13[1] 0[0] 14[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 18[1] 0[0] 19[1] 20[0] The first line here indicates that the offset/length table contains 45 entries. Each entry is a pair of integers, denoted by C. Entries are numbered starting with 1, so entry #1 here is C<1[4]> and entry #12 is C<5[1]>. C<1[4]> indicates that the node labeled C<1:> (the C<1: ANYOF[bc]>) begins at character position 1 in the pre-compiled form of the regex, and has a length of 4 characters. C<5[1]> in position 12 indicates that the node labeled C<12:> (the C<< 12: EXACT >>) begins at character position 5 in the pre-compiled form of the regex, and has a length of 1 character. C<12[1]> in position 14 indicates that the node labeled C<14:> (the C<< 14: CURLYX[0] {1,32767} >>) begins at character position 12 in the pre-compiled form of the regex, and has a length of 1 character---that is, it corresponds to the C<+> symbol in the precompiled regex. C<0[0]> items indicate that there is no corresponding node. =head2 Run-time Output First of all, when doing a match, one may get no run-time output even if debugging is enabled. This means that the regex engine was never entered and that all of the job was therefore done by the optimizer. If the regex engine was entered, the output may look like this: Matching '[bc]d(ef*g)+h[ij]k$' against 'abcdefg__gh__' Setting an EVAL scope, savestack=3 2 | 1: ANYOF 3 | 11: EXACT 4 | 13: CURLYX {1,32767} 4 | 26: WHILEM 0 out of 1..32767 cc=effff31c 4 | 15: OPEN1 4 | 17: EXACT 5 | 19: STAR EXACT can match 1 times out of 32767... Setting an EVAL scope, savestack=3 6 | 22: EXACT 7 <__gh__> | 24: CLOSE1 7 <__gh__> | 26: WHILEM 1 out of 1..32767 cc=effff31c Setting an EVAL scope, savestack=12 7 <__gh__> | 15: OPEN1 7 <__gh__> | 17: EXACT restoring \1 to 4(4)..7 failed, try continuation... 7 <__gh__> | 27: NOTHING 7 <__gh__> | 28: EXACT failed... failed... The most significant information in the output is about the particular I of the compiled regex that is currently being tested against the target string. The format of these lines is C< >I > > |I: I The I info is indented with respect to the backtracking level. Other incidental information appears interspersed within. =head1 Debugging Perl Memory Usage Perl is a profligate wastrel when it comes to memory use. There is a saying that to estimate memory usage of Perl, assume a reasonable algorithm for memory allocation, multiply that estimate by 10, and while you still may miss the mark, at least you won't be quite so astonished. This is not absolutely true, but may provide a good grasp of what happens. Assume that an integer cannot take less than 20 bytes of memory, a float cannot take less than 24 bytes, a string cannot take less than 32 bytes (all these examples assume 32-bit architectures, the result are quite a bit worse on 64-bit architectures). If a variable is accessed in two of three different ways (which require an integer, a float, or a string), the memory footprint may increase yet another 20 bytes. A sloppy malloc(3) implementation can inflate these numbers dramatically. On the opposite end of the scale, a declaration like sub foo; may take up to 500 bytes of memory, depending on which release of Perl you're running. Anecdotal estimates of source-to-compiled code bloat suggest an eightfold increase. This means that the compiled form of reasonable (normally commented, properly indented etc.) code will take about eight times more space in memory than the code took on disk. The B<-DL> command-line switch is obsolete since circa Perl 5.6.0 (it was available only if Perl was built with C<-DDEBUGGING>). The switch was used to track Perl's memory allocations and possible memory leaks. These days the use of malloc debugging tools like F or F is suggested instead. See also L. One way to find out how much memory is being used by Perl data structures is to install the Devel::Size module from CPAN: it gives you the minimum number of bytes required to store a particular data structure. Please be mindful of the difference between the size() and total_size(). If Perl has been compiled using Perl's malloc you can analyze Perl memory usage by setting $ENV{PERL_DEBUG_MSTATS}. =head2 Using C<$ENV{PERL_DEBUG_MSTATS}> If your perl is using Perl's malloc() and was compiled with the necessary switches (this is the default), then it will print memory usage statistics after compiling your code when C<< $ENV{PERL_DEBUG_MSTATS} > 1 >>, and before termination of the program when C<< $ENV{PERL_DEBUG_MSTATS} >= 1 >>. The report format is similar to the following example: $ PERL_DEBUG_MSTATS=2 perl -e "require Carp" Memory allocation statistics after compilation: (buckets 4(4)..8188(8192) 14216 free: 130 117 28 7 9 0 2 2 1 0 0 437 61 36 0 5 60924 used: 125 137 161 55 7 8 6 16 2 0 1 74 109 304 84 20 Total sbrk(): 77824/21:119. Odd ends: pad+heads+chain+tail: 0+636+0+2048. Memory allocation statistics after execution: (buckets 4(4)..8188(8192) 30888 free: 245 78 85 13 6 2 1 3 2 0 1 315 162 39 42 11 175816 used: 265 176 1112 111 26 22 11 27 2 1 1 196 178 1066 798 39 Total sbrk(): 215040/47:145. Odd ends: pad+heads+chain+tail: 0+2192+0+6144. It is possible to ask for such a statistic at arbitrary points in your execution using the mstat() function out of the standard Devel::Peek module. Here is some explanation of that format: =over 4 =item C Perl's malloc() uses bucketed allocations. Every request is rounded up to the closest bucket size available, and a bucket is taken from the pool of buckets of that size. The line above describes the limits of buckets currently in use. Each bucket has two sizes: memory footprint and the maximal size of user data that can fit into this bucket. Suppose in the above example that the smallest bucket were size 4. The biggest bucket would have usable size 8188, and the memory footprint would be 8192. In a Perl built for debugging, some buckets may have negative usable size. This means that these buckets cannot (and will not) be used. For larger buckets, the memory footprint may be one page greater than a power of 2. If so, the corresponding power of two is printed in the C field above. =item Free/Used The 1 or 2 rows of numbers following that correspond to the number of buckets of each size between C and C. In the first row, the sizes (memory footprints) of buckets are powers of two--or possibly one page greater. In the second row, if present, the memory footprints of the buckets are between the memory footprints of two buckets "above". For example, suppose under the previous example, the memory footprints were free: 8 16 32 64 128 256 512 1024 2048 4096 8192 4 12 24 48 80 With a non-C perl, the buckets starting from C<128> have a 4-byte overhead, and thus an 8192-long bucket may take up to 8188-byte allocations. =item C The first two fields give the total amount of memory perl sbrk(2)ed (ess-broken? :-) and number of sbrk(2)s used. The third number is what perl thinks about continuity of returned chunks. So long as this number is positive, malloc() will assume that it is probable that sbrk(2) will provide continuous memory. Memory allocated by external libraries is not counted. =item C The amount of sbrk(2)ed memory needed to keep buckets aligned. =item C Although memory overhead of bigger buckets is kept inside the bucket, for smaller buckets, it is kept in separate areas. This field gives the total size of these areas. =item C malloc() may want to subdivide a bigger bucket into smaller buckets. If only a part of the deceased bucket is left unsubdivided, the rest is kept as an element of a linked list. This field gives the total size of these chunks. =item C To minimize the number of sbrk(2)s, malloc() asks for more memory. This field gives the size of the yet unused part, which is sbrk(2)ed, but never touched. =back =head1 SEE ALSO L, L, L L, and L. perlintern.pod000064400000125041147633747270007461 0ustar00-*- buffer-read-only: t -*- !!!!!!! DO NOT EDIT THIS FILE !!!!!!! This file is built by autodoc.pl extracting documentation from the C source files. =head1 NAME perlintern - autogenerated documentation of purely B Perl functions =head1 DESCRIPTION X X This file is the autogenerated documentation of functions in the Perl interpreter that are documented using Perl's internal documentation format but are not marked as part of the Perl API. In other words, B! =head1 Compile-time scope hooks =over 8 =item BhkENTRY X Return an entry from the BHK structure. I is a preprocessor token indicating which entry to return. If the appropriate flag is not set this will return NULL. The type of the return value depends on which entry you ask for. NOTE: this function is experimental and may change or be removed without notice. void * BhkENTRY(BHK *hk, which) =for hackers Found in file op.h =item BhkFLAGS X Return the BHK's flags. NOTE: this function is experimental and may change or be removed without notice. U32 BhkFLAGS(BHK *hk) =for hackers Found in file op.h =item CALL_BLOCK_HOOKS X Call all the registered block hooks for type I. I is a preprocessing token; the type of I depends on I. NOTE: this function is experimental and may change or be removed without notice. void CALL_BLOCK_HOOKS(which, arg) =for hackers Found in file op.h =back =head1 CV reference counts and CvOUTSIDE =over 8 =item CvWEAKOUTSIDE X Each CV has a pointer, C, to its lexically enclosing CV (if any). Because pointers to anonymous sub prototypes are stored in C<&> pad slots, it is a possible to get a circular reference, with the parent pointing to the child and vice-versa. To avoid the ensuing memory leak, we do not increment the reference count of the CV pointed to by C in the I that the parent has a C<&> pad slot pointing back to us. In this case, we set the C flag in the child. This allows us to determine under what circumstances we should decrement the refcount of the parent when freeing the child. There is a further complication with non-closure anonymous subs (i.e. those that do not refer to any lexicals outside that sub). In this case, the anonymous prototype is shared rather than being cloned. This has the consequence that the parent may be freed while there are still active children, eg BEGIN { $a = sub { eval '$x' } } In this case, the BEGIN is freed immediately after execution since there are no active references to it: the anon sub prototype has C set since it's not a closure, and $a points to the same CV, so it doesn't contribute to BEGIN's refcount either. When $a is executed, the C causes the chain of Cs to be followed, and the freed BEGIN is accessed. To avoid this, whenever a CV and its associated pad is freed, any C<&> entries in the pad are explicitly removed from the pad, and if the refcount of the pointed-to anon sub is still positive, then that child's C is set to point to its grandparent. This will only occur in the single specific case of a non-closure anon prototype having one or more active references (such as C<$a> above). One other thing to consider is that a CV may be merely undefined rather than freed, eg C. In this case, its refcount may not have reached zero, but we still delete its pad and its C etc. Since various children may still have their C pointing at this undefined CV, we keep its own C for the time being, so that the chain of lexical scopes is unbroken. For example, the following should print 123: my $x = 123; sub tmp { sub { eval '$x' } } my $a = tmp(); undef &tmp; print $a->(); bool CvWEAKOUTSIDE(CV *cv) =for hackers Found in file cv.h =back =head1 Embedding Functions =over 8 =item cv_dump X dump the contents of a CV void cv_dump(CV *cv, const char *title) =for hackers Found in file pad.c =item do_dump_pad X Dump the contents of a padlist void do_dump_pad(I32 level, PerlIO *file, PADLIST *padlist, int full) =for hackers Found in file pad.c =item intro_my X "Introduce" my variables to visible status. U32 intro_my() =for hackers Found in file pad.c =item padlist_dup X Duplicates a pad. AV * padlist_dup(AV *srcpad, CLONE_PARAMS *param) =for hackers Found in file pad.c =item pad_alloc_name X Allocates a place in the currently-compiling pad (via L) and then stores a name for that entry. I is adopted and becomes the name entry; it must already contain the name string and be sufficiently upgraded. I and I and the C flag get added to I. None of the other processing of L is done. Returns the offset of the allocated pad slot. PADOFFSET pad_alloc_name(SV *namesv, U32 flags, HV *typestash, HV *ourstash) =for hackers Found in file pad.c =item pad_block_start X Update the pad compilation state variables on entry to a new block void pad_block_start(int full) =for hackers Found in file pad.c =item pad_check_dup X Check for duplicate declarations: report any of: * a my in the current scope with the same name; * an our (anywhere in the pad) with the same name and the same stash as C C indicates that the name to check is an 'our' declaration. void pad_check_dup(SV *name, U32 flags, const HV *ourstash) =for hackers Found in file pad.c =item pad_findlex X Find a named lexical anywhere in a chain of nested pads. Add fake entries in the inner pads if it's found in an outer one. Returns the offset in the bottom pad of the lex or the fake lex. cv is the CV in which to start the search, and seq is the current cop_seq to match against. If warn is true, print appropriate warnings. The out_* vars return values, and so are pointers to where the returned values should be stored. out_capture, if non-null, requests that the innermost instance of the lexical is captured; out_name_sv is set to the innermost matched namesv or fake namesv; out_flags returns the flags normally associated with the IVX field of a fake namesv. Note that pad_findlex() is recursive; it recurses up the chain of CVs, then comes back down, adding fake entries as it goes. It has to be this way because fake namesvs in anon protoypes have to store in xlow the index into the parent pad. PADOFFSET pad_findlex(const char *namepv, STRLEN namelen, U32 flags, const CV* cv, U32 seq, int warn, SV** out_capture, SV** out_name_sv, int *out_flags) =for hackers Found in file pad.c =item pad_fixup_inner_anons X For any anon CVs in the pad, change CvOUTSIDE of that CV from old_cv to new_cv if necessary. Needed when a newly-compiled CV has to be moved to a pre-existing CV struct. void pad_fixup_inner_anons(PADLIST *padlist, CV *old_cv, CV *new_cv) =for hackers Found in file pad.c =item pad_free X Free the SV at offset po in the current pad. void pad_free(PADOFFSET po) =for hackers Found in file pad.c =item pad_leavemy X Cleanup at end of scope during compilation: set the max seq number for lexicals in this scope and warn of any lexicals that never got introduced. void pad_leavemy() =for hackers Found in file pad.c =item pad_push X Push a new pad frame onto the padlist, unless there's already a pad at this depth, in which case don't bother creating a new one. Then give the new pad an @_ in slot zero. void pad_push(PADLIST *padlist, int depth) =for hackers Found in file pad.c =item pad_reset X Mark all the current temporaries for reuse void pad_reset() =for hackers Found in file pad.c =item pad_swipe X Abandon the tmp in the current pad at offset po and replace with a new one. void pad_swipe(PADOFFSET po, bool refadjust) =for hackers Found in file pad.c =back =head1 Functions in file op.c =over 8 =item core_prototype X This function assigns the prototype of the named core function to C, or to a new mortal SV if C is NULL. It returns the modified C, or NULL if the core function has no prototype. C is a code as returned by C. It must be negative and unequal to -KEY_CORE. SV * core_prototype(SV *sv, const char *name, const int code, int * const opnum) =for hackers Found in file op.c =back =head1 Functions in file pp_ctl.c =over 8 =item docatch X Check for the cases 0 or 3 of cur_env.je_ret, only used inside an eval context. 0 is used as continue inside eval, 3 is used for a die caught by an inner eval - continue inner loop See cop.h: je_mustcatch, when set at any runlevel to TRUE, means eval ops must establish a local jmpenv to handle exception traps. OP* docatch(OP *o) =for hackers Found in file pp_ctl.c =back =head1 GV Functions =over 8 =item gv_try_downgrade X If the typeglob C can be expressed more succinctly, by having something other than a real GV in its place in the stash, replace it with the optimised form. Basic requirements for this are that C is a real typeglob, is sufficiently ordinary, and is only referenced from its package. This function is meant to be used when a GV has been looked up in part to see what was there, causing upgrading, but based on what was found it turns out that the real GV isn't required after all. If C is a completely empty typeglob, it is deleted from the stash. If C is a typeglob containing only a sufficiently-ordinary constant sub, the typeglob is replaced with a scalar-reference placeholder that more compactly represents the same thing. NOTE: this function is experimental and may change or be removed without notice. void gv_try_downgrade(GV* gv) =for hackers Found in file gv.c =back =head1 Hash Manipulation Functions =over 8 =item hv_ename_add X Adds a name to a stash's internal list of effective names. See C. This is called when a stash is assigned to a new location in the symbol table. void hv_ename_add(HV *hv, const char *name, U32 len, U32 flags) =for hackers Found in file hv.c =item hv_ename_delete X Removes a name from a stash's internal list of effective names. If this is the name returned by C, then another name in the list will take its place (C will use it). This is called when a stash is deleted from the symbol table. void hv_ename_delete(HV *hv, const char *name, U32 len, U32 flags) =for hackers Found in file hv.c =item refcounted_he_chain_2hv X Generates and returns a C representing the content of a C chain. I is currently unused and must be zero. HV * refcounted_he_chain_2hv( const struct refcounted_he *c, U32 flags ) =for hackers Found in file hv.c =item refcounted_he_fetch_pv X Like L, but takes a nul-terminated string instead of a string/length pair. SV * refcounted_he_fetch_pv( const struct refcounted_he *chain, const char *key, U32 hash, U32 flags ) =for hackers Found in file hv.c =item refcounted_he_fetch_pvn X Search along a C chain for an entry with the key specified by I and I. If I has the C bit set, the key octets are interpreted as UTF-8, otherwise they are interpreted as Latin-1. I is a precomputed hash of the key string, or zero if it has not been precomputed. Returns a mortal scalar representing the value associated with the key, or C<&PL_sv_placeholder> if there is no value associated with the key. SV * refcounted_he_fetch_pvn( const struct refcounted_he *chain, const char *keypv, STRLEN keylen, U32 hash, U32 flags ) =for hackers Found in file hv.c =item refcounted_he_fetch_pvs X Like L, but takes a literal string instead of a string/length pair, and no precomputed hash. SV * refcounted_he_fetch_pvs( const struct refcounted_he *chain, const char *key, U32 flags ) =for hackers Found in file hv.h =item refcounted_he_fetch_sv X Like L, but takes a Perl scalar instead of a string/length pair. SV * refcounted_he_fetch_sv( const struct refcounted_he *chain, SV *key, U32 hash, U32 flags ) =for hackers Found in file hv.c =item refcounted_he_free X Decrements the reference count of a C by one. If the reference count reaches zero the structure's memory is freed, which (recursively) causes a reduction of its parent C's reference count. It is safe to pass a null pointer to this function: no action occurs in this case. void refcounted_he_free(struct refcounted_he *he) =for hackers Found in file hv.c =item refcounted_he_inc X Increment the reference count of a C. The pointer to the C is also returned. It is safe to pass a null pointer to this function: no action occurs and a null pointer is returned. struct refcounted_he * refcounted_he_inc( struct refcounted_he *he ) =for hackers Found in file hv.c =item refcounted_he_new_pv X Like L, but takes a nul-terminated string instead of a string/length pair. struct refcounted_he * refcounted_he_new_pv( struct refcounted_he *parent, const char *key, U32 hash, SV *value, U32 flags ) =for hackers Found in file hv.c =item refcounted_he_new_pvn X Creates a new C. This consists of a single key/value pair and a reference to an existing C chain (which may be empty), and thus forms a longer chain. When using the longer chain, the new key/value pair takes precedence over any entry for the same key further along the chain. The new key is specified by I and I. If I has the C bit set, the key octets are interpreted as UTF-8, otherwise they are interpreted as Latin-1. I is a precomputed hash of the key string, or zero if it has not been precomputed. I is the scalar value to store for this key. I is copied by this function, which thus does not take ownership of any reference to it, and later changes to the scalar will not be reflected in the value visible in the C. Complex types of scalar will not be stored with referential integrity, but will be coerced to strings. I may be either null or C<&PL_sv_placeholder> to indicate that no value is to be associated with the key; this, as with any non-null value, takes precedence over the existence of a value for the key further along the chain. I points to the rest of the C chain to be attached to the new C. This function takes ownership of one reference to I, and returns one reference to the new C. struct refcounted_he * refcounted_he_new_pvn( struct refcounted_he *parent, const char *keypv, STRLEN keylen, U32 hash, SV *value, U32 flags ) =for hackers Found in file hv.c =item refcounted_he_new_pvs X Like L, but takes a literal string instead of a string/length pair, and no precomputed hash. struct refcounted_he * refcounted_he_new_pvs( struct refcounted_he *parent, const char *key, SV *value, U32 flags ) =for hackers Found in file hv.h =item refcounted_he_new_sv X Like L, but takes a Perl scalar instead of a string/length pair. struct refcounted_he * refcounted_he_new_sv( struct refcounted_he *parent, SV *key, U32 hash, SV *value, U32 flags ) =for hackers Found in file hv.c =back =head1 IO Functions =over 8 =item start_glob X Function called by C to spawn a glob (or do the glob inside perl on VMS). This code used to be inline, but now perl uses C this glob starter is only used by miniperl during the build process. Moving it away shrinks pp_hot.c; shrinking pp_hot.c helps speed perl up. NOTE: this function is experimental and may change or be removed without notice. PerlIO* start_glob(SV *tmpglob, IO *io) =for hackers Found in file doio.c =back =head1 Magical Functions =over 8 =item magic_clearhint X Triggered by a delete from %^H, records the key to C. int magic_clearhint(SV* sv, MAGIC* mg) =for hackers Found in file mg.c =item magic_clearhints X Triggered by clearing %^H, resets C. int magic_clearhints(SV* sv, MAGIC* mg) =for hackers Found in file mg.c =item magic_methcall X Invoke a magic method (like FETCH). C and C are the tied thingy and the tie magic. C is the name of the method to call. C is the number of args (in addition to $self) to pass to the method. The C can be: G_DISCARD invoke method with G_DISCARD flag and don't return a value G_UNDEF_FILL fill the stack with argc pointers to PL_sv_undef The arguments themselves are any values following the C argument. Returns the SV (if any) returned by the method, or NULL on failure. SV* magic_methcall(SV *sv, const MAGIC *mg, const char *meth, U32 flags, U32 argc, ...) =for hackers Found in file mg.c =item magic_sethint X Triggered by a store to %^H, records the key/value pair to C. It is assumed that hints aren't storing anything that would need a deep copy. Maybe we should warn if we find a reference. int magic_sethint(SV* sv, MAGIC* mg) =for hackers Found in file mg.c =item mg_localize X Copy some of the magic from an existing SV to new localized version of that SV. Container magic (eg %ENV, $1, tie) gets copied, value magic doesn't (eg taint, pos). If setmagic is false then no set magic will be called on the new (empty) SV. This typically means that assignment will soon follow (e.g. 'local $x = $y'), and that will handle the magic. void mg_localize(SV* sv, SV* nsv, bool setmagic) =for hackers Found in file mg.c =back =head1 MRO Functions =over 8 =item mro_get_linear_isa_dfs X Returns the Depth-First Search linearization of @ISA the given stash. The return value is a read-only AV*. C should be 0 (it is used internally in this function's recursion). You are responsible for C on the return value if you plan to store it anywhere semi-permanently (otherwise it might be deleted out from under you the next time the cache is invalidated). AV* mro_get_linear_isa_dfs(HV* stash, U32 level) =for hackers Found in file mro.c =item mro_isa_changed_in X Takes the necessary steps (cache invalidations, mostly) when the @ISA of the given package has changed. Invoked by the C magic, should not need to invoke directly. void mro_isa_changed_in(HV* stash) =for hackers Found in file mro.c =item mro_package_moved X Call this function to signal to a stash that it has been assigned to another spot in the stash hierarchy. C is the stash that has been assigned. C is the stash it replaces, if any. C is the glob that is actually being assigned to. This can also be called with a null first argument to indicate that C has been deleted. This function invalidates isa caches on the old stash, on all subpackages nested inside it, and on the subclasses of all those, including non-existent packages that have corresponding entries in C. It also sets the effective names (C) on all the stashes as appropriate. If the C is present and is not in the symbol table, then this function simply returns. This checked will be skipped if C. void mro_package_moved(HV * const stash, HV * const oldstash, const GV * const gv, U32 flags) =for hackers Found in file mro.c =back =head1 Optree Manipulation Functions =over 8 =item finalize_optree X This function finalizes the optree. Should be called directly after the complete optree is built. It does some additional checking which can't be done in the normal ck_xxx functions and makes the tree thread-safe. void finalize_optree(OP* o) =for hackers Found in file op.c =back =head1 Pad Data Structures =over 8 =item CX_CURPAD_SAVE X Save the current pad in the given context block structure. void CX_CURPAD_SAVE(struct context) =for hackers Found in file pad.h =item CX_CURPAD_SV X Access the SV at offset po in the saved current pad in the given context block structure (can be used as an lvalue). SV * CX_CURPAD_SV(struct context, PADOFFSET po) =for hackers Found in file pad.h =item PAD_BASE_SV X Get the value from slot C in the base (DEPTH=1) pad of a padlist SV * PAD_BASE_SV(PADLIST padlist, PADOFFSET po) =for hackers Found in file pad.h =item PAD_CLONE_VARS X Clone the state variables associated with running and compiling pads. void PAD_CLONE_VARS(PerlInterpreter *proto_perl, CLONE_PARAMS* param) =for hackers Found in file pad.h =item PAD_COMPNAME_FLAGS X Return the flags for the current compiling pad name at offset C. Assumes a valid slot entry. U32 PAD_COMPNAME_FLAGS(PADOFFSET po) =for hackers Found in file pad.h =item PAD_COMPNAME_GEN X The generation number of the name at offset C in the current compiling pad (lvalue). Note that C is hijacked for this purpose. STRLEN PAD_COMPNAME_GEN(PADOFFSET po) =for hackers Found in file pad.h =item PAD_COMPNAME_GEN_set X Sets the generation number of the name at offset C in the current ling pad (lvalue) to C. Note that C is hijacked for this purpose. STRLEN PAD_COMPNAME_GEN_set(PADOFFSET po, int gen) =for hackers Found in file pad.h =item PAD_COMPNAME_OURSTASH X Return the stash associated with an C variable. Assumes the slot entry is a valid C lexical. HV * PAD_COMPNAME_OURSTASH(PADOFFSET po) =for hackers Found in file pad.h =item PAD_COMPNAME_PV X Return the name of the current compiling pad name at offset C. Assumes a valid slot entry. char * PAD_COMPNAME_PV(PADOFFSET po) =for hackers Found in file pad.h =item PAD_COMPNAME_TYPE X Return the type (stash) of the current compiling pad name at offset C. Must be a valid name. Returns null if not typed. HV * PAD_COMPNAME_TYPE(PADOFFSET po) =for hackers Found in file pad.h =item pad_peg X When PERL_MAD is enabled, this is a small no-op function that gets called at the start of each pad-related function. It can be breakpointed to track all pad operations. The parameter is a string indicating the type of pad operation being performed. NOTE: this function is experimental and may change or be removed without notice. void pad_peg(const char *s) =for hackers Found in file pad.c =item PAD_RESTORE_LOCAL X Restore the old pad saved into the local variable opad by PAD_SAVE_LOCAL() void PAD_RESTORE_LOCAL(PAD *opad) =for hackers Found in file pad.h =item PAD_SAVE_LOCAL X Save the current pad to the local variable opad, then make the current pad equal to npad void PAD_SAVE_LOCAL(PAD *opad, PAD *npad) =for hackers Found in file pad.h =item PAD_SAVE_SETNULLPAD X Save the current pad then set it to null. void PAD_SAVE_SETNULLPAD() =for hackers Found in file pad.h =item PAD_SETSV X Set the slot at offset C in the current pad to C SV * PAD_SETSV(PADOFFSET po, SV* sv) =for hackers Found in file pad.h =item PAD_SET_CUR X Set the current pad to be pad C in the padlist, saving the previous current pad. NB currently this macro expands to a string too long for some compilers, so it's best to replace it with SAVECOMPPAD(); PAD_SET_CUR_NOSAVE(padlist,n); void PAD_SET_CUR(PADLIST padlist, I32 n) =for hackers Found in file pad.h =item PAD_SET_CUR_NOSAVE X like PAD_SET_CUR, but without the save void PAD_SET_CUR_NOSAVE(PADLIST padlist, I32 n) =for hackers Found in file pad.h =item PAD_SV X Get the value at offset C in the current pad void PAD_SV(PADOFFSET po) =for hackers Found in file pad.h =item PAD_SVl X Lightweight and lvalue version of C. Get or set the value at offset C in the current pad. Unlike C, does not print diagnostics with -DX. For internal use only. SV * PAD_SVl(PADOFFSET po) =for hackers Found in file pad.h =item SAVECLEARSV X Clear the pointed to pad value on scope exit. (i.e. the runtime action of 'my') void SAVECLEARSV(SV **svp) =for hackers Found in file pad.h =item SAVECOMPPAD X save PL_comppad and PL_curpad void SAVECOMPPAD() =for hackers Found in file pad.h =item SAVEPADSV X Save a pad slot (used to restore after an iteration) XXX DAPM it would make more sense to make the arg a PADOFFSET void SAVEPADSV(PADOFFSET po) =for hackers Found in file pad.h =back =head1 Per-Interpreter Variables =over 8 =item PL_DBsingle X When Perl is run in debugging mode, with the B<-d> switch, this SV is a boolean which indicates whether subs are being single-stepped. Single-stepping is automatically turned on after every step. This is the C variable which corresponds to Perl's $DB::single variable. See C. SV * PL_DBsingle =for hackers Found in file intrpvar.h =item PL_DBsub X When Perl is run in debugging mode, with the B<-d> switch, this GV contains the SV which holds the name of the sub being debugged. This is the C variable which corresponds to Perl's $DB::sub variable. See C. GV * PL_DBsub =for hackers Found in file intrpvar.h =item PL_DBtrace X Trace variable used when Perl is run in debugging mode, with the B<-d> switch. This is the C variable which corresponds to Perl's $DB::trace variable. See C. SV * PL_DBtrace =for hackers Found in file intrpvar.h =item PL_dowarn X The C variable which corresponds to Perl's $^W warning variable. bool PL_dowarn =for hackers Found in file intrpvar.h =item PL_last_in_gv X The GV which was last used for a filehandle input operation. (C<< >>) GV* PL_last_in_gv =for hackers Found in file intrpvar.h =item PL_ofsgv X The glob containing the output field separator - C<*,> in Perl space. GV* PL_ofsgv =for hackers Found in file intrpvar.h =item PL_rs X The input record separator - C<$/> in Perl space. SV* PL_rs =for hackers Found in file intrpvar.h =back =head1 Stack Manipulation Macros =over 8 =item djSP X Declare Just C. This is actually identical to C, and declares a local copy of perl's stack pointer, available via the C macro. See C. (Available for backward source code compatibility with the old (Perl 5.005) thread model.) djSP; =for hackers Found in file pp.h =item LVRET X True if this op will be the return value of an lvalue subroutine =for hackers Found in file pp.h =back =head1 SV Manipulation Functions =over 8 =item sv_add_arena X Given a chunk of memory, link it to the head of the list of arenas, and split it into a list of free SVs. void sv_add_arena(char *const ptr, const U32 size, const U32 flags) =for hackers Found in file sv.c =item sv_clean_all X Decrement the refcnt of each remaining SV, possibly triggering a cleanup. This function may have to be called multiple times to free SVs which are in complex self-referential hierarchies. I32 sv_clean_all() =for hackers Found in file sv.c =item sv_clean_objs X Attempt to destroy all objects not yet freed. void sv_clean_objs() =for hackers Found in file sv.c =item sv_free_arenas X Deallocate the memory used by all arenas. Note that all the individual SV heads and bodies within the arenas must already have been freed. void sv_free_arenas() =for hackers Found in file sv.c =back =head1 SV-Body Allocation =over 8 =item sv_2num X Return an SV with the numeric value of the source SV, doing any necessary reference or overload conversion. You must use the C macro to access this function. NOTE: this function is experimental and may change or be removed without notice. SV* sv_2num(SV *const sv) =for hackers Found in file sv.c =item sv_ref X Returns a SV describing what the SV passed in is a reference to. SV* sv_ref(SV *dst, const SV *const sv, const int ob) =for hackers Found in file sv.c =back =head1 Unicode Support =over 8 =item find_uninit_var X Find the name of the undefined variable (if any) that caused the operator to issue a "Use of uninitialized value" warning. If match is true, only return a name if its value matches uninit_sv. So roughly speaking, if a unary operator (such as OP_COS) generates a warning, then following the direct child of the op may yield an OP_PADSV or OP_GV that gives the name of the undefined variable. On the other hand, with OP_ADD there are two branches to follow, so we only print the variable name if we get an exact match. The name is returned as a mortal SV. Assumes that PL_op is the op that originally triggered the error, and that PL_comppad/PL_curpad points to the currently executing pad. NOTE: this function is experimental and may change or be removed without notice. SV* find_uninit_var(const OP *const obase, const SV *const uninit_sv, bool top) =for hackers Found in file sv.c =item report_uninit X Print appropriate "Use of uninitialized variable" warning. void report_uninit(const SV *uninit_sv) =for hackers Found in file sv.c =back =head1 Undocumented functions The following functions are currently undocumented. If you use one of them, you may wish to consider creating and submitting documentation for it. =over =item _add_range_to_invlist X<_add_range_to_invlist> =item _core_swash_init X<_core_swash_init> =item _invlist_array_init X<_invlist_array_init> =item _invlist_contents X<_invlist_contents> =item _invlist_intersection X<_invlist_intersection> =item _invlist_intersection_maybe_complement_2nd X<_invlist_intersection_maybe_complement_2nd> =item _invlist_invert X<_invlist_invert> =item _invlist_invert_prop X<_invlist_invert_prop> =item _invlist_populate_swatch X<_invlist_populate_swatch> =item _invlist_subtract X<_invlist_subtract> =item _invlist_union X<_invlist_union> =item _invlist_union_maybe_complement_2nd X<_invlist_union_maybe_complement_2nd> =item _is_utf8__perl_idstart X<_is_utf8__perl_idstart> =item _new_invlist X<_new_invlist> =item _swash_inversion_hash X<_swash_inversion_hash> =item _swash_to_invlist X<_swash_to_invlist> =item _to_fold_latin1 X<_to_fold_latin1> =item _to_upper_title_latin1 X<_to_upper_title_latin1> =item aassign_common_vars X =item add_cp_to_invlist X =item addmad X =item allocmy X =item amagic_is_enabled X =item append_madprops X =item apply X =item av_reify X =item bind_match X =item block_end X =item block_start X =item boot_core_PerlIO X =item boot_core_UNIVERSAL X =item boot_core_mro X =item cando X =item check_utf8_print X =item ck_entersub_args_core X =item convert X =item coresub_op X =item create_eval_scope X =item cv_ckproto_len_flags X =item cvgv_set X =item cvstash_set X =item deb_stack_all X =item delete_eval_scope X =item die_unwind X =item do_aexec X =item do_aexec5 X =item do_eof X =item do_exec X =item do_exec3 X =item do_execfree X =item do_ipcctl X =item do_ipcget X =item do_msgrcv X =item do_msgsnd X =item do_ncmp X =item do_op_xmldump X =item do_pmop_xmldump X =item do_print X =item do_readline X =item do_seek X =item do_semop X =item do_shmio X =item do_sysseek X =item do_tell X =item do_trans X =item do_vecget X =item do_vecset X =item do_vop X =item dofile X =item dump_all_perl X =item dump_packsubs_perl X =item dump_sub_perl X =item dump_sv_child X =item emulate_cop_io X =item feature_is_enabled X =item find_rundefsv2 X =item find_script X =item free_tied_hv_pool X =item get_db_sub X =item get_debug_opts X =item get_hash_seed X =item get_invlist_iter_addr X =item get_invlist_len_addr X =item get_invlist_version_id_addr X =item get_invlist_zero_addr X =item get_no_modify X =item get_opargs X =item get_re_arg X =item getenv_len X =item hfree_next_entry X =item hv_backreferences_p X =item hv_kill_backrefs X =item hv_undef_flags X =item init_argv_symbols X =item init_dbargs X =item init_debugger X =item invert X =item invlist_array X =item invlist_clone X =item invlist_iterinit X =item invlist_len X =item invlist_max X =item invlist_set_len X =item invlist_trim X =item io_close X =item is_utf8_X_L X =item is_utf8_X_LV X =item is_utf8_X_LVT X =item is_utf8_X_LV_LVT_V X =item is_utf8_X_T X =item is_utf8_X_V X =item is_utf8_X_begin X =item is_utf8_X_extend X =item is_utf8_X_non_hangul X =item is_utf8_X_prepend X =item jmaybe X =item keyword X =item keyword_plugin_standard X =item list X =item localize X =item mad_free X =item madlex X =item madparse X =item magic_clear_all_env X =item magic_clearenv X =item magic_clearisa X =item magic_clearpack X =item magic_clearsig X =item magic_existspack X =item magic_freearylen_p X =item magic_freeovrld X =item magic_get X =item magic_getarylen X =item magic_getdefelem X =item magic_getnkeys X =item magic_getpack X =item magic_getpos X =item magic_getsig X =item magic_getsubstr X =item magic_gettaint X =item magic_getuvar X =item magic_getvec X =item magic_killbackrefs X =item magic_len X =item magic_nextpack X =item magic_regdata_cnt X =item magic_regdatum_get X =item magic_regdatum_set X =item magic_scalarpack X =item magic_set X =item magic_set_all_env X =item magic_setamagic X =item magic_setarylen X =item magic_setcollxfrm X =item magic_setdbline X =item magic_setdefelem X =item magic_setenv X =item magic_setisa X =item magic_setmglob X =item magic_setnkeys X =item magic_setpack X =item magic_setpos X =item magic_setregexp X =item magic_setsig X =item magic_setsubstr X =item magic_settaint X =item magic_setutf8 X =item magic_setuvar X =item magic_setvec X =item magic_setvstring X =item magic_sizepack X =item magic_wipepack X =item malloc_good_size X =item malloced_size X =item mem_collxfrm X =item mode_from_discipline X =item more_bodies X =item mro_meta_dup X =item mro_meta_init X =item munge_qwlist_to_paren_list X =item my_attrs X =item my_betoh16 X =item my_betoh32 X =item my_betoh64 X =item my_betohi X =item my_betohl X =item my_betohs X =item my_clearenv X =item my_htobe16 X =item my_htobe32 X =item my_htobe64 X =item my_htobei X =item my_htobel X =item my_htobes X =item my_htole16 X =item my_htole32 X =item my_htole64 X =item my_htolei X =item my_htolel X =item my_htoles X =item my_letoh16 X =item my_letoh32 X =item my_letoh64 X =item my_letohi X =item my_letohl X =item my_letohs X =item my_lstat_flags X =item my_stat_flags X =item my_swabn X =item my_unexec X =item newATTRSUB_flags X =item newGP X =item newMADPROP X =item newMADsv X =item newTOKEN X =item newXS_len_flags X =item new_warnings_bitfield X =item nextargv X =item oopsAV X =item oopsHV X =item op_clear X =item op_const_sv X =item op_getmad X =item op_getmad_weak X =item op_integerize X =item op_lvalue_flags X =item op_refcnt_dec X =item op_refcnt_inc X =item op_std_init X =item op_xmldump X =item package X =item package_version X =item parse_unicode_opts X =item parser_free X =item peep X =item pending_Slabs_to_ro X =item pmop_xmldump X =item pmruntime X =item populate_isa X =item prepend_madprops X =item qerror X =item reg_named_buff X =item reg_named_buff_iter X =item reg_numbered_buff_fetch X =item reg_numbered_buff_length X =item reg_numbered_buff_store X =item reg_qr_package X =item reg_temp_copy X =item regcurly X =item regprop X =item report_evil_fh X =item report_redefined_cv X =item report_wrongway_fh X =item rpeep X =item rsignal_restore X =item rsignal_save X =item rxres_save X =item same_dirent X =item sawparens X =item scalar X =item scalarvoid X =item set_regclass_bit X =item sighandler X =item softref2xv X =item sub_crush_depth X =item sv_add_backref X =item sv_catxmlpv X =item sv_catxmlpvn X =item sv_catxmlsv X =item sv_compile_2op_is_broken X =item sv_del_backref X =item sv_free2 X =item sv_kill_backrefs X =item sv_sethek X =item sv_setsv_cow X =item sv_unglob X =item sv_xmlpeek X =item tied_method X =item token_free X =item token_getmad X =item translate_substr_offsets X =item try_amagic_bin X =item try_amagic_un X =item unshare_hek X =item utilize X =item varname X =item vivify_defelem X =item vivify_ref X =item wait4pid X =item was_lvalue_sub X =item watch X =item write_to_stderr X =item xmldump_all X =item xmldump_all_perl X =item xmldump_eval X =item xmldump_form X =item xmldump_indent X =item xmldump_packsubs X =item xmldump_packsubs_perl X =item xmldump_sub X =item xmldump_sub_perl X =item xmldump_vindent X =item xs_apiversion_bootcheck X =item xs_version_bootcheck X =item yyerror X =item yyerror_pv X =item yyerror_pvn X =item yylex X =item yyparse X =item yyunlex X =back =head1 AUTHORS The autodocumentation system was originally added to the Perl core by Benjamin Stuhl. Documentation is by whoever was kind enough to document their functions. =head1 SEE ALSO L, L =cut ex: set ro: perlbot.pod000064400000000266147633747270006747 0ustar00=encoding utf8 =head1 NAME perlbot - This document has been deleted =head1 DESCRIPTION For information on OO programming with Perl, please see L and L. =cut perlvar.pod000064400000212301147633747270006746 0ustar00=head1 NAME perlvar - Perl predefined variables =head1 DESCRIPTION =head2 The Syntax of Variable Names Variable names in Perl can have several formats. Usually, they must begin with a letter or underscore, in which case they can be arbitrarily long (up to an internal limit of 251 characters) and may contain letters, digits, underscores, or the special sequence C<::> or C<'>. In this case, the part before the last C<::> or C<'> is taken to be a I; see L. Perl variable names may also be a sequence of digits or a single punctuation or control character. These names are all reserved for special uses by Perl; for example, the all-digits names are used to hold data captured by backreferences after a regular expression match. Perl has a special syntax for the single-control-character names: It understands C<^X> (caret C) to mean the control-C character. For example, the notation C<$^W> (dollar-sign caret C) is the scalar variable whose name is the single character control-C. This is better than typing a literal control-C into your program. Since Perl 5.6, Perl variable names may be alphanumeric strings that begin with control characters (or better yet, a caret). These variables must be written in the form C<${^Foo}>; the braces are not optional. C<${^Foo}> denotes the scalar variable whose name is a control-C followed by two C's. These variables are reserved for future special uses by Perl, except for the ones that begin with C<^_> (control-underscore or caret-underscore). No control-character name that begins with C<^_> will acquire a special meaning in any future version of Perl; such names may therefore be used safely in programs. C<$^_> itself, however, I reserved. Perl identifiers that begin with digits, control characters, or punctuation characters are exempt from the effects of the C declaration and are always forced to be in package C

; they are also exempt from C errors. A few other names are also exempt in these ways: ENV STDIN INC STDOUT ARGV STDERR ARGVOUT SIG In particular, the special C<${^_XYZ}> variables are always taken to be in package C
, regardless of any C declarations presently in scope. =head1 SPECIAL VARIABLES The following names have special meaning to Perl. Most punctuation names have reasonable mnemonics, or analogs in the shells. Nevertheless, if you wish to use long variable names, you need only say: use English; at the top of your program. This aliases all the short names to the long names in the current package. Some even have medium names, generally borrowed from B. To avoid a performance hit, if you don't need the C<$PREMATCH>, C<$MATCH>, or C<$POSTMATCH> it's best to use the C module without them: use English '-no_match_vars'; Before you continue, note the sort order for variables. In general, we first list the variables in case-insensitive, almost-lexigraphical order (ignoring the C<{> or C<^> preceding words, as in C<${^UNICODE}> or C<$^T>), although C<$_> and C<@_> move up to the top of the pile. For variables with the same identifier, we list it in order of scalar, array, hash, and bareword. =head2 General Variables =over 8 =item $ARG =item $_ X<$_> X<$ARG> The default input and pattern-searching space. The following pairs are equivalent: while (<>) {...} # equivalent only in while! while (defined($_ = <>)) {...} /^Subject:/ $_ =~ /^Subject:/ tr/a-z/A-Z/ $_ =~ tr/a-z/A-Z/ chomp chomp($_) Here are the places where Perl will assume C<$_> even if you don't use it: =over 3 =item * The following functions use C<$_> as a default argument: abs, alarm, chomp, chop, chr, chroot, cos, defined, eval, evalbytes, exp, glob, hex, int, lc, lcfirst, length, log, lstat, mkdir, oct, ord, pos, print, quotemeta, readlink, readpipe, ref, require, reverse (in scalar context only), rmdir, sin, split (on its second argument), sqrt, stat, study, uc, ucfirst, unlink, unpack. =item * All file tests (C<-f>, C<-d>) except for C<-t>, which defaults to STDIN. See L =item * The pattern matching operations C, C and C (aka C) when used without an C<=~> operator. =item * The default iterator variable in a C loop if no other variable is supplied. =item * The implicit iterator variable in the C and C functions. =item * The implicit variable of C. =item * The default place to put an input record when a C<< >> operation's result is tested by itself as the sole criterion of a C test. Outside a C test, this will not happen. =back As C<$_> is a global variable, this may lead in some cases to unwanted side-effects. As of perl 5.10, you can now use a lexical version of C<$_> by declaring it in a file or in a block with C. Moreover, declaring C restores the global C<$_> in the current scope. Mnemonic: underline is understood in certain operations. =item @ARG =item @_ X<@_> X<@ARG> Within a subroutine the array C<@_> contains the parameters passed to that subroutine. Inside a subroutine, C<@_> is the default array for the array operators C, C, C, and C. See L. =item $LIST_SEPARATOR =item $" X<$"> X<$LIST_SEPARATOR> When an array or an array slice is interpolated into a double-quoted string or a similar context such as C, its elements are separated by this value. Default is a space. For example, this: print "The array is: @array\n"; is equivalent to this: print "The array is: " . join($", @array) . "\n"; Mnemonic: works in double-quoted context. =item $PROCESS_ID =item $PID =item $$ X<$$> X<$PID> X<$PROCESS_ID> The process number of the Perl running this script. Though you I set this variable, doing so is generally discouraged, although it can be invaluable for some testing purposes. It will be reset automatically across C calls. Note for Linux and Debian GNU/kFreeBSD users: Before Perl v5.16.0 perl would emulate POSIX semantics on Linux systems using LinuxThreads, a partial implementation of POSIX Threads that has since been superseded by the Native POSIX Thread Library (NPTL). LinuxThreads is now obsolete on Linux, and and caching C like this made embedding perl unnecessarily complex (since you'd have to manually update the value of $$), so now C<$$> and C will always return the same values as the underlying C library. Debian GNU/kFreeBSD systems also used LinuxThreads up until and including the 6.0 release, but after that moved to FreeBSD thread semantics, which are POSIX-like. To see if your system is affected by this discrepancy check if C returns a false value. NTPL threads preserve the POSIX semantics. Mnemonic: same as shells. =item $PROGRAM_NAME =item $0 X<$0> X<$PROGRAM_NAME> Contains the name of the program being executed. On some (but not all) operating systems assigning to C<$0> modifies the argument area that the C program sees. On some platforms you may have to use special C options or a different C to see the changes. Modifying the C<$0> is more useful as a way of indicating the current program state than it is for hiding the program you're running. Note that there are platform-specific limitations on the maximum length of C<$0>. In the most extreme case it may be limited to the space occupied by the original C<$0>. In some platforms there may be arbitrary amount of padding, for example space characters, after the modified name as shown by C. In some platforms this padding may extend all the way to the original length of the argument area, no matter what you do (this is the case for example with Linux 2.2). Note for BSD users: setting C<$0> does not completely remove "perl" from the ps(1) output. For example, setting C<$0> to C<"foobar"> may result in C<"perl: foobar (perl)"> (whether both the C<"perl: "> prefix and the " (perl)" suffix are shown depends on your exact BSD variant and version). This is an operating system feature, Perl cannot help it. In multithreaded scripts Perl coordinates the threads so that any thread may modify its copy of the C<$0> and the change becomes visible to ps(1) (assuming the operating system plays along). Note that the view of C<$0> the other threads have will not change since they have their own copies of it. If the program has been given to perl via the switches C<-e> or C<-E>, C<$0> will contain the string C<"-e">. On Linux as of perl 5.14 the legacy process name will be set with C, in addition to altering the POSIX name via C as perl has done since version 4.000. Now system utilities that read the legacy process name such as ps, top and killall will recognize the name you set when assigning to C<$0>. The string you supply will be cut off at 16 bytes, this is a limitation imposed by Linux. Mnemonic: same as B and B. =item $REAL_GROUP_ID =item $GID =item $( X<$(> X<$GID> X<$REAL_GROUP_ID> The real gid of this process. If you are on a machine that supports membership in multiple groups simultaneously, gives a space separated list of groups you are in. The first number is the one returned by C, and the subsequent ones by C, one of which may be the same as the first number. However, a value assigned to C<$(> must be a single number used to set the real gid. So the value given by C<$(> should I be assigned back to C<$(> without being forced numeric, such as by adding zero. Note that this is different to the effective gid (C<$)>) which does take a list. You can change both the real gid and the effective gid at the same time by using C. Changes to C<$(> require a check to C<$!> to detect any possible errors after an attempted change. Mnemonic: parentheses are used to I things. The real gid is the group you I, if you're running setgid. =item $EFFECTIVE_GROUP_ID =item $EGID =item $) X<$)> X<$EGID> X<$EFFECTIVE_GROUP_ID> The effective gid of this process. If you are on a machine that supports membership in multiple groups simultaneously, gives a space separated list of groups you are in. The first number is the one returned by C, and the subsequent ones by C, one of which may be the same as the first number. Similarly, a value assigned to C<$)> must also be a space-separated list of numbers. The first number sets the effective gid, and the rest (if any) are passed to C. To get the effect of an empty list for C, just repeat the new effective gid; that is, to force an effective gid of 5 and an effectively empty C list, say C< $) = "5 5" >. You can change both the effective gid and the real gid at the same time by using C (use only a single numeric argument). Changes to C<$)> require a check to C<$!> to detect any possible errors after an attempted change. C<< $< >>, C<< $> >>, C<$(> and C<$)> can be set only on machines that support the corresponding I routine. C<$(> and C<$)> can be swapped only on machines supporting C. Mnemonic: parentheses are used to I things. The effective gid is the group that's I for you, if you're running setgid. =item $REAL_USER_ID =item $UID =item $< X<< $< >> X<$UID> X<$REAL_USER_ID> The real uid of this process. You can change both the real uid and the effective uid at the same time by using C. Since changes to C<< $< >> require a system call, check C<$!> after a change attempt to detect any possible errors. Mnemonic: it's the uid you came I, if you're running setuid. =item $EFFECTIVE_USER_ID =item $EUID =item $> X<< $> >> X<$EUID> X<$EFFECTIVE_USER_ID> The effective uid of this process. For example: $< = $>; # set real to effective uid ($<,$>) = ($>,$<); # swap real and effective uids You can change both the effective uid and the real uid at the same time by using C. Changes to C<< $> >> require a check to C<$!> to detect any possible errors after an attempted change. C<< $< >> and C<< $> >> can be swapped only on machines supporting C. Mnemonic: it's the uid you went I, if you're running setuid. =item $SUBSCRIPT_SEPARATOR =item $SUBSEP =item $; X<$;> X<$SUBSEP> X The subscript separator for multidimensional array emulation. If you refer to a hash element as $foo{$a,$b,$c} it really means $foo{join($;, $a, $b, $c)} But don't put @foo{$a,$b,$c} # a slice--note the @ which means ($foo{$a},$foo{$b},$foo{$c}) Default is "\034", the same as SUBSEP in B. If your keys contain binary data there might not be any safe value for C<$;>. Consider using "real" multidimensional arrays as described in L. Mnemonic: comma (the syntactic subscript separator) is a semi-semicolon. =item $a =item $b X<$a> X<$b> Special package variables when using C, see L. Because of this specialness C<$a> and C<$b> don't need to be declared (using C, or C) even when using the C pragma. Don't lexicalize them with C or C if you want to be able to use them in the C comparison block or function. =item %ENV X<%ENV> The hash C<%ENV> contains your current environment. Setting a value in C changes the environment for any child processes you subsequently C off. =item $SYSTEM_FD_MAX =item $^F X<$^F> X<$SYSTEM_FD_MAX> The maximum system file descriptor, ordinarily 2. System file descriptors are passed to Ced processes, while higher file descriptors are not. Also, during an C, system file descriptors are preserved even if the C fails (ordinary file descriptors are closed before the C is attempted). The close-on-exec status of a file descriptor will be decided according to the value of C<$^F> when the corresponding file, pipe, or socket was opened, not the time of the C. =item @F X<@F> The array C<@F> contains the fields of each line read in when autosplit mode is turned on. See L for the B<-a> switch. This array is package-specific, and must be declared or given a full package name if not in package main when running under C. =item @INC X<@INC> The array C<@INC> contains the list of places that the C, C, or C constructs look for their library files. It initially consists of the arguments to any B<-I> command-line switches, followed by the default Perl library, probably F, followed by ".", to represent the current directory. ("." will not be appended if taint checks are enabled, either by C<-T> or by C<-t>.) If you need to modify this at runtime, you should use the C pragma to get the machine-dependent library properly loaded also: use lib '/mypath/libdir/'; use SomeMod; You can also insert hooks into the file inclusion system by putting Perl code directly into C<@INC>. Those hooks may be subroutine references, array references or blessed objects. See L for details. =item %INC X<%INC> The hash C<%INC> contains entries for each filename included via the C, C, or C operators. The key is the filename you specified (with module names converted to pathnames), and the value is the location of the file found. The C operator uses this hash to determine whether a particular file has already been included. If the file was loaded via a hook (e.g. a subroutine reference, see L for a description of these hooks), this hook is by default inserted into C<%INC> in place of a filename. Note, however, that the hook may have set the C<%INC> entry by itself to provide some more specific info. =item $INPLACE_EDIT =item $^I X<$^I> X<$INPLACE_EDIT> The current value of the inplace-edit extension. Use C to disable inplace editing. Mnemonic: value of B<-i> switch. =item $^M X<$^M> By default, running out of memory is an untrappable, fatal error. However, if suitably built, Perl can use the contents of C<$^M> as an emergency memory pool after Cing. Suppose that your Perl were compiled with C<-DPERL_EMERGENCY_SBRK> and used Perl's malloc. Then $^M = 'a' x (1 << 16); would allocate a 64K buffer for use in an emergency. See the F file in the Perl distribution for information on how to add custom C compilation flags when compiling perl. To discourage casual use of this advanced feature, there is no L long name for this variable. This variable was added in Perl 5.004. =item $OSNAME =item $^O X<$^O> X<$OSNAME> The name of the operating system under which this copy of Perl was built, as determined during the configuration process. For examples see L. The value is identical to C<$Config{'osname'}>. See also L and the B<-V> command-line switch documented in L. In Windows platforms, C<$^O> is not very helpful: since it is always C, it doesn't tell the difference between 95/98/ME/NT/2000/XP/CE/.NET. Use C or Win32::GetOSVersion() (see L and L) to distinguish between the variants. This variable was added in Perl 5.003. =item %SIG X<%SIG> The hash C<%SIG> contains signal handlers for signals. For example: sub handler { # 1st argument is signal name my($sig) = @_; print "Caught a SIG$sig--shutting down\n"; close(LOG); exit(0); } $SIG{'INT'} = \&handler; $SIG{'QUIT'} = \&handler; ... $SIG{'INT'} = 'DEFAULT'; # restore default action $SIG{'QUIT'} = 'IGNORE'; # ignore SIGQUIT Using a value of C<'IGNORE'> usually has the effect of ignoring the signal, except for the C signal. See L for more about this special case. Here are some other examples: $SIG{"PIPE"} = "Plumber"; # assumes main::Plumber (not # recommended) $SIG{"PIPE"} = \&Plumber; # just fine; assume current # Plumber $SIG{"PIPE"} = *Plumber; # somewhat esoteric $SIG{"PIPE"} = Plumber(); # oops, what did Plumber() # return?? Be sure not to use a bareword as the name of a signal handler, lest you inadvertently call it. If your system has the C function then signal handlers are installed using it. This means you get reliable signal handling. The default delivery policy of signals changed in Perl 5.8.0 from immediate (also known as "unsafe") to deferred, also known as "safe signals". See L for more information. Certain internal hooks can be also set using the C<%SIG> hash. The routine indicated by C<$SIG{__WARN__}> is called when a warning message is about to be printed. The warning message is passed as the first argument. The presence of a C<__WARN__> hook causes the ordinary printing of warnings to C to be suppressed. You can use this to save warnings in a variable, or turn warnings into fatal errors, like this: local $SIG{__WARN__} = sub { die $_[0] }; eval $proggie; As the C<'IGNORE'> hook is not supported by C<__WARN__>, you can disable warnings using the empty subroutine: local $SIG{__WARN__} = sub {}; The routine indicated by C<$SIG{__DIE__}> is called when a fatal exception is about to be thrown. The error message is passed as the first argument. When a C<__DIE__> hook routine returns, the exception processing continues as it would have in the absence of the hook, unless the hook routine itself exits via a C, a loop exit, or a C. The C<__DIE__> handler is explicitly disabled during the call, so that you can die from a C<__DIE__> handler. Similarly for C<__WARN__>. Due to an implementation glitch, the C<$SIG{__DIE__}> hook is called even inside an C. Do not use this to rewrite a pending exception in C<$@>, or as a bizarre substitute for overriding C. This strange action at a distance may be fixed in a future release so that C<$SIG{__DIE__}> is only called if your program is about to exit, as was the original intent. Any other use is deprecated. C<__DIE__>/C<__WARN__> handlers are very special in one respect: they may be called to report (probable) errors found by the parser. In such a case the parser may be in inconsistent state, so any attempt to evaluate Perl code from such a handler will probably result in a segfault. This means that warnings or errors that result from parsing Perl should be used with extreme caution, like this: require Carp if defined $^S; Carp::confess("Something wrong") if defined &Carp::confess; die "Something wrong, but could not load Carp to give " . "backtrace...\n\t" . "To see backtrace try starting Perl with -MCarp switch"; Here the first line will load C I it is the parser who called the handler. The second line will print backtrace and die if C was available. The third line will be executed only if C was not available. Having to even think about the C<$^S> variable in your exception handlers is simply wrong. C<$SIG{__DIE__}> as currently implemented invites grievous and difficult to track down errors. Avoid it and use an C or CORE::GLOBAL::die override instead. See L, L, L, and L for additional information. =item $BASETIME =item $^T X<$^T> X<$BASETIME> The time at which the program began running, in seconds since the epoch (beginning of 1970). The values returned by the B<-M>, B<-A>, and B<-C> filetests are based on this value. =item $PERL_VERSION =item $^V X<$^V> X<$PERL_VERSION> The revision, version, and subversion of the Perl interpreter, represented as a C object. This variable first appeared in perl 5.6.0; earlier versions of perl will see an undefined value. Before perl 5.10.0 C<$^V> was represented as a v-string. C<$^V> can be used to determine whether the Perl interpreter executing a script is in the right range of versions. For example: warn "Hashes not randomized!\n" if !$^V or $^V lt v5.8.1 To convert C<$^V> into its string representation use C's C<"%vd"> conversion: printf "version is v%vd\n", $^V; # Perl's version See the documentation of C and C for a convenient way to fail if the running Perl interpreter is too old. See also C<$]> for an older representation of the Perl version. This variable was added in Perl 5.6. Mnemonic: use ^V for Version Control. =item ${^WIN32_SLOPPY_STAT} X<${^WIN32_SLOPPY_STAT}> X X If this variable is set to a true value, then C on Windows will not try to open the file. This means that the link count cannot be determined and file attributes may be out of date if additional hardlinks to the file exist. On the other hand, not opening the file is considerably faster, especially for files on network drives. This variable could be set in the F file to configure the local Perl installation to use "sloppy" C by default. See the documentation for B<-f> in L for more information about site customization. This variable was added in Perl 5.10. =item $EXECUTABLE_NAME =item $^X X<$^X> X<$EXECUTABLE_NAME> The name used to execute the current copy of Perl, from C's C or (where supported) F. Depending on the host operating system, the value of C<$^X> may be a relative or absolute pathname of the perl program file, or may be the string used to invoke perl but not the pathname of the perl program file. Also, most operating systems permit invoking programs that are not in the PATH environment variable, so there is no guarantee that the value of C<$^X> is in PATH. For VMS, the value may or may not include a version number. You usually can use the value of C<$^X> to re-invoke an independent copy of the same perl that is currently running, e.g., @first_run = `$^X -le "print int rand 100 for 1..100"`; But recall that not all operating systems support forking or capturing of the output of commands, so this complex statement may not be portable. It is not safe to use the value of C<$^X> as a path name of a file, as some operating systems that have a mandatory suffix on executable files do not require use of the suffix when invoking a command. To convert the value of C<$^X> to a path name, use the following statements: # Build up a set of file names (not command names). use Config; my $this_perl = $^X; if ($^O ne 'VMS') { $this_perl .= $Config{_exe} unless $this_perl =~ m/$Config{_exe}$/i; } Because many operating systems permit anyone with read access to the Perl program file to make a copy of it, patch the copy, and then execute the copy, the security-conscious Perl programmer should take care to invoke the installed copy of perl, not the copy referenced by C<$^X>. The following statements accomplish this goal, and produce a pathname that can be invoked as a command or referenced as a file. use Config; my $secure_perl_path = $Config{perlpath}; if ($^O ne 'VMS') { $secure_perl_path .= $Config{_exe} unless $secure_perl_path =~ m/$Config{_exe}$/i; } =back =head2 Variables related to regular expressions Most of the special variables related to regular expressions are side effects. Perl sets these variables when it has a successful match, so you should check the match result before using them. For instance: if( /P(A)TT(ER)N/ ) { print "I found $1 and $2\n"; } These variables are read-only and dynamically-scoped, unless we note otherwise. The dynamic nature of the regular expression variables means that their value is limited to the block that they are in, as demonstrated by this bit of code: my $outer = 'Wallace and Grommit'; my $inner = 'Mutt and Jeff'; my $pattern = qr/(\S+) and (\S+)/; sub show_n { print "\$1 is $1; \$2 is $2\n" } { OUTER: show_n() if $outer =~ m/$pattern/; INNER: { show_n() if $inner =~ m/$pattern/; } show_n(); } The output shows that while in the C block, the values of C<$1> and C<$2> are from the match against C<$outer>. Inside the C block, the values of C<$1> and C<$2> are from the match against C<$inner>, but only until the end of the block (i.e. the dynamic scope). After the C block completes, the values of C<$1> and C<$2> return to the values for the match against C<$outer> even though we have not made another match: $1 is Wallace; $2 is Grommit $1 is Mutt; $2 is Jeff $1 is Wallace; $2 is Grommit Due to an unfortunate accident of Perl's implementation, C imposes a considerable performance penalty on all regular expression matches in a program because it uses the C<$`>, C<$&>, and C<$'>, regardless of whether they occur in the scope of C. For that reason, saying C in libraries is strongly discouraged unless you import it without the match variables: use English '-no_match_vars' The C and C modules can help you find uses of these problematic match variables in your code. Since Perl 5.10, you can use the C

match operator flag and the C<${^PREMATCH}>, C<${^MATCH}>, and C<${^POSTMATCH}> variables instead so you only suffer the performance penalties. =over 8 =item $> ($1, $2, ...) X<$1> X<$2> X<$3> Contains the subpattern from the corresponding set of capturing parentheses from the last successful pattern match, not counting patterns matched in nested blocks that have been exited already. These variables are read-only and dynamically-scoped. Mnemonic: like \digits. =item $MATCH =item $& X<$&> X<$MATCH> The string matched by the last successful pattern match (not counting any matches hidden within a BLOCK or C enclosed by the current BLOCK). The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. To avoid this penalty, you can extract the same substring by using L. Starting with Perl 5.10, you can use the C

match flag and the C<${^MATCH}> variable to do the same thing for particular match operations. This variable is read-only and dynamically-scoped. Mnemonic: like C<&> in some editors. =item ${^MATCH} X<${^MATCH}> This is similar to C<$&> (C<$MATCH>) except that it does not incur the performance penalty associated with that variable, and is only guaranteed to return a defined value when the pattern was compiled or executed with the C

modifier. This variable was added in Perl 5.10. This variable is read-only and dynamically-scoped. =item $PREMATCH =item $` X<$`> X<$PREMATCH> X<${^PREMATCH}> The string preceding whatever was matched by the last successful pattern match, not counting any matches hidden within a BLOCK or C enclosed by the current BLOCK. The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. To avoid this penalty, you can extract the same substring by using L. Starting with Perl 5.10, you can use the C

match flag and the C<${^PREMATCH}> variable to do the same thing for particular match operations. This variable is read-only and dynamically-scoped. Mnemonic: C<`> often precedes a quoted string. =item ${^PREMATCH} X<$`> X<${^PREMATCH}> This is similar to C<$`> ($PREMATCH) except that it does not incur the performance penalty associated with that variable, and is only guaranteed to return a defined value when the pattern was compiled or executed with the C

modifier. This variable was added in Perl 5.10 This variable is read-only and dynamically-scoped. =item $POSTMATCH =item $' X<$'> X<$POSTMATCH> X<${^POSTMATCH}> X<@-> The string following whatever was matched by the last successful pattern match (not counting any matches hidden within a BLOCK or C enclosed by the current BLOCK). Example: local $_ = 'abcdefghi'; /def/; print "$`:$&:$'\n"; # prints abc:def:ghi The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. To avoid this penalty, you can extract the same substring by using L. Starting with Perl 5.10, you can use the C

match flag and the C<${^POSTMATCH}> variable to do the same thing for particular match operations. This variable is read-only and dynamically-scoped. Mnemonic: C<'> often follows a quoted string. =item ${^POSTMATCH} X<${^POSTMATCH}> X<$'> X<$POSTMATCH> This is similar to C<$'> (C<$POSTMATCH>) except that it does not incur the performance penalty associated with that variable, and is only guaranteed to return a defined value when the pattern was compiled or executed with the C

modifier. This variable was added in Perl 5.10. This variable is read-only and dynamically-scoped. =item $LAST_PAREN_MATCH =item $+ X<$+> X<$LAST_PAREN_MATCH> The text matched by the last bracket of the last successful search pattern. This is useful if you don't know which one of a set of alternative patterns matched. For example: /Version: (.*)|Revision: (.*)/ && ($rev = $+); This variable is read-only and dynamically-scoped. Mnemonic: be positive and forward looking. =item $LAST_SUBMATCH_RESULT =item $^N X<$^N> X<$LAST_SUBMATCH_RESULT> The text matched by the used group most-recently closed (i.e. the group with the rightmost closing parenthesis) of the last successful search pattern. This is primarily used inside C<(?{...})> blocks for examining text recently matched. For example, to effectively capture text to a variable (in addition to C<$1>, C<$2>, etc.), replace C<(...)> with (?:(...)(?{ $var = $^N })) By setting and then using C<$var> in this way relieves you from having to worry about exactly which numbered set of parentheses they are. This variable was added in Perl 5.8. Mnemonic: the (possibly) Nested parenthesis that most recently closed. =item @LAST_MATCH_END =item @+ X<@+> X<@LAST_MATCH_END> This array holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. C<$+[0]> is the offset into the string of the end of the entire match. This is the same value as what the C function returns when called on the variable that was matched against. The Ith element of this array holds the offset of the Ith submatch, so C<$+[1]> is the offset past where C<$1> ends, C<$+[2]> the offset past where C<$2> ends, and so on. You can use C<$#+> to determine how many subgroups were in the last successful match. See the examples given for the C<@-> variable. This variable was added in Perl 5.6. =item %LAST_PAREN_MATCH =item %+ X<%+> X<%LAST_PAREN_MATCH> Similar to C<@+>, the C<%+> hash allows access to the named capture buffers, should they exist, in the last successful match in the currently active dynamic scope. For example, C<$+{foo}> is equivalent to C<$1> after the following match: 'foo' =~ /(?foo)/; The keys of the C<%+> hash list only the names of buffers that have captured (and that are thus associated to defined values). The underlying behaviour of C<%+> is provided by the L module. B C<%-> and C<%+> are tied views into a common internal hash associated with the last successful regular expression. Therefore mixing iterative access to them via C may have unpredictable results. Likewise, if the last successful match changes, then the results may be surprising. This variable was added in Perl 5.10. This variable is read-only and dynamically-scoped. =item @LAST_MATCH_START =item @- X<@-> X<@LAST_MATCH_START> C<$-[0]> is the offset of the start of the last successful match. C<$-[>IC<]> is the offset of the start of the substring matched by I-th subpattern, or undef if the subpattern did not match. Thus, after a match against C<$_>, C<$&> coincides with C. Similarly, $I coincides with C if C<$-[n]> is defined, and $+ coincides with C. One can use C<$#-> to find the last matched subgroup in the last successful match. Contrast with C<$#+>, the number of subgroups in the regular expression. Compare with C<@+>. This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. C<$-[0]> is the offset into the string of the beginning of the entire match. The Ith element of this array holds the offset of the Ith submatch, so C<$-[1]> is the offset where C<$1> begins, C<$-[2]> the offset where C<$2> begins, and so on. After a match against some variable C<$var>: =over 5 =item C<$`> is the same as C =item C<$&> is the same as C =item C<$'> is the same as C =item C<$1> is the same as C =item C<$2> is the same as C =item C<$3> is the same as C =back This variable was added in Perl 5.6. =item %LAST_MATCH_START =item %- X<%-> X<%LAST_MATCH_START> Similar to C<%+>, this variable allows access to the named capture groups in the last successful match in the currently active dynamic scope. To each capture group name found in the regular expression, it associates a reference to an array containing the list of values captured by all buffers with that name (should there be several of them), in the order where they appear. Here's an example: if ('1234' =~ /(?1)(?2)(?3)(?4)/) { foreach my $bufname (sort keys %-) { my $ary = $-{$bufname}; foreach my $idx (0..$#$ary) { print "\$-{$bufname}[$idx] : ", (defined($ary->[$idx]) ? "'$ary->[$idx]'" : "undef"), "\n"; } } } would print out: $-{A}[0] : '1' $-{A}[1] : '3' $-{B}[0] : '2' $-{B}[1] : '4' The keys of the C<%-> hash correspond to all buffer names found in the regular expression. The behaviour of C<%-> is implemented via the L module. B C<%-> and C<%+> are tied views into a common internal hash associated with the last successful regular expression. Therefore mixing iterative access to them via C may have unpredictable results. Likewise, if the last successful match changes, then the results may be surprising. This variable was added in Perl 5.10 This variable is read-only and dynamically-scoped. =item $LAST_REGEXP_CODE_RESULT =item $^R X<$^R> X<$LAST_REGEXP_CODE_RESULT> The result of evaluation of the last successful C<(?{ code })> regular expression assertion (see L). May be written to. This variable was added in Perl 5.005. =item ${^RE_DEBUG_FLAGS} X<${^RE_DEBUG_FLAGS}> The current value of the regex debugging flags. Set to 0 for no debug output even when the C module is loaded. See L for details. This variable was added in Perl 5.10. =item ${^RE_TRIE_MAXBUF} X<${^RE_TRIE_MAXBUF}> Controls how certain regex optimisations are applied and how much memory they utilize. This value by default is 65536 which corresponds to a 512kB temporary cache. Set this to a higher value to trade memory for speed when matching large alternations. Set it to a lower value if you want the optimisations to be as conservative of memory as possible but still occur, and set it to a negative value to prevent the optimisation and conserve the most memory. Under normal situations this variable should be of no interest to you. This variable was added in Perl 5.10. =back =head2 Variables related to filehandles Variables that depend on the currently selected filehandle may be set by calling an appropriate object method on the C object, although this is less efficient than using the regular built-in variables. (Summary lines below for this contain the word HANDLE.) First you must say use IO::Handle; after which you may use either method HANDLE EXPR or more safely, HANDLE->method(EXPR) Each method returns the old value of the C attribute. The methods each take an optional EXPR, which, if supplied, specifies the new value for the C attribute in question. If not supplied, most methods do nothing to the current value--except for C, which will assume a 1 for you, just to be different. Because loading in the C class is an expensive operation, you should learn how to use the regular built-in variables. A few of these variables are considered "read-only". This means that if you try to assign to this variable, either directly or indirectly through a reference, you'll raise a run-time exception. You should be very careful when modifying the default values of most special variables described in this document. In most cases you want to localize these variables before changing them, since if you don't, the change may affect other modules which rely on the default values of the special variables that you have changed. This is one of the correct ways to read the whole file at once: open my $fh, "<", "foo" or die $!; local $/; # enable localized slurp mode my $content = <$fh>; close $fh; But the following code is quite bad: open my $fh, "<", "foo" or die $!; undef $/; # enable slurp mode my $content = <$fh>; close $fh; since some other module, may want to read data from some file in the default "line mode", so if the code we have just presented has been executed, the global value of C<$/> is now changed for any other code running inside the same Perl interpreter. Usually when a variable is localized you want to make sure that this change affects the shortest scope possible. So unless you are already inside some short C<{}> block, you should create one yourself. For example: my $content = ''; open my $fh, "<", "foo" or die $!; { local $/; $content = <$fh>; } close $fh; Here is an example of how your own code can go broken: for ( 1..3 ){ $\ = "\r\n"; nasty_break(); print "$_"; } sub nasty_break { $\ = "\f"; # do something with $_ } You probably expect this code to print the equivalent of "1\r\n2\r\n3\r\n" but instead you get: "1\f2\f3\f" Why? Because C modifies C<$\> without localizing it first. The value you set in C is still there when you return. The fix is to add C so the value doesn't leak out of C: local $\ = "\f"; It's easy to notice the problem in such a short example, but in more complicated code you are looking for trouble if you don't localize changes to the special variables. =over 8 =item $ARGV X<$ARGV> Contains the name of the current file when reading from C<< <> >>. =item @ARGV X<@ARGV> The array C<@ARGV> contains the command-line arguments intended for the script. C<$#ARGV> is generally the number of arguments minus one, because C<$ARGV[0]> is the first argument, I the program's command name itself. See L for the command name. =item ARGV X The special filehandle that iterates over command-line filenames in C<@ARGV>. Usually written as the null filehandle in the angle operator C<< <> >>. Note that currently C only has its magical effect within the C<< <> >> operator; elsewhere it is just a plain filehandle corresponding to the last file opened by C<< <> >>. In particular, passing C<\*ARGV> as a parameter to a function that expects a filehandle may not cause your function to automatically read the contents of all the files in C<@ARGV>. =item ARGVOUT X The special filehandle that points to the currently open output file when doing edit-in-place processing with B<-i>. Useful when you have to do a lot of inserting and don't want to keep modifying C<$_>. See L for the B<-i> switch. =item Handle->output_field_separator( EXPR ) =item $OUTPUT_FIELD_SEPARATOR =item $OFS =item $, X<$,> X<$OFS> X<$OUTPUT_FIELD_SEPARATOR> The output field separator for the print operator. If defined, this value is printed between each of print's arguments. Default is C. Mnemonic: what is printed when there is a "," in your print statement. =item HANDLE->input_line_number( EXPR ) =item $INPUT_LINE_NUMBER =item $NR =item $. X<$.> X<$NR> X<$INPUT_LINE_NUMBER> X Current line number for the last filehandle accessed. Each filehandle in Perl counts the number of lines that have been read from it. (Depending on the value of C<$/>, Perl's idea of what constitutes a line may not match yours.) When a line is read from a filehandle (via C or C<< <> >>), or when C or C is called on it, C<$.> becomes an alias to the line counter for that filehandle. You can adjust the counter by assigning to C<$.>, but this will not actually move the seek pointer. I will not localize the filehandle's line count>. Instead, it will localize perl's notion of which filehandle C<$.> is currently aliased to. C<$.> is reset when the filehandle is closed, but B when an open filehandle is reopened without an intervening C. For more details, see LO Operators">. Because C<< <> >> never does an explicit close, line numbers increase across C files (but see examples in L). You can also use C<< HANDLE->input_line_number(EXPR) >> to access the line counter for a given filehandle without having to worry about which handle you last accessed. Mnemonic: many programs use "." to mean the current line number. =item HANDLE->input_record_separator( EXPR ) =item $INPUT_RECORD_SEPARATOR =item $RS =item $/ X<$/> X<$RS> X<$INPUT_RECORD_SEPARATOR> The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like B's RS variable, including treating empty lines as a terminator if set to the null string (an empty line cannot contain any spaces or tabs). You may set it to a multi-character string to match a multi-character terminator, or to C to read through the end of file. Setting it to C<"\n\n"> means something slightly different than setting to C<"">, if the file contains consecutive empty lines. Setting to C<""> will treat two or more consecutive empty lines as a single empty line. Setting to C<"\n\n"> will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. local $/; # enable "slurp" mode local $_ = ; # whole file now here s/\n[ \t]+/ /g; Remember: the value of C<$/> is a string, not a regex. B has to be better for something. :-) Setting C<$/> to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this: local $/ = \32768; # or \"32768", or \$var_containing_32768 open my $fh, "<", $myfile or die $!; local $_ = <$fh>; will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. Trying to set the record size to zero or less will cause reading in the (rest of the) whole file. On VMS only, record reads bypass PerlIO layers and any associated buffering,so you must not mix record and non-record reads on the same filehandle. Record mode mixes with line mode only when the same buffering layer is in use for both modes. If you perform a record read on a FILE with an encoding layer such as C<:encoding(latin1)> or C<:utf8>, you may get an invalid string as a result, may leave the FILE positioned between characters in the stream and may not be reading the number of bytes from the underlying file that you specified. This behaviour may change without warning in a future version of perl. See also L. Also see L. Mnemonic: / delimits line boundaries when quoting poetry. =item Handle->output_record_separator( EXPR ) =item $OUTPUT_RECORD_SEPARATOR =item $ORS =item $\ X<$\> X<$ORS> X<$OUTPUT_RECORD_SEPARATOR> The output record separator for the print operator. If defined, this value is printed after the last of print's arguments. Default is C. Mnemonic: you set C<$\> instead of adding "\n" at the end of the print. Also, it's just like C<$/>, but it's what you get "back" from Perl. =item HANDLE->autoflush( EXPR ) =item $OUTPUT_AUTOFLUSH =item $| X<$|> X X X<$OUTPUT_AUTOFLUSH> If set to nonzero, forces a flush right away and after every write or print on the currently selected output channel. Default is 0 (regardless of whether the channel is really buffered by the system or not; C<$|> tells you only whether you've asked Perl explicitly to flush after each write). STDOUT will typically be line buffered if output is to the terminal and block buffered otherwise. Setting this variable is useful primarily when you are outputting to a pipe or socket, such as when you are running a Perl program under B and want to see the output as it's happening. This has no effect on input buffering. See L for that. See L on how to select the output channel. See also L. Mnemonic: when you want your pipes to be piping hot. =back =head3 Variables related to formats The special variables for formats are a subset of those for filehandles. See L for more information about Perl's formats. =over 8 =item $ACCUMULATOR =item $^A X<$^A> X<$ACCUMULATOR> The current value of the C accumulator for C lines. A format contains C calls that put their result into C<$^A>. After calling its format, C prints out the contents of C<$^A> and empties. So you never really see the contents of C<$^A> unless you call C yourself and then look at it. See L and L. =item HANDLE->format_formfeed(EXPR) =item $FORMAT_FORMFEED =item $^L X<$^L> X<$FORMAT_FORMFEED> What formats output as a form feed. The default is C<\f>. =item HANDLE->format_page_number(EXPR) =item $FORMAT_PAGE_NUMBER =item $% X<$%> X<$FORMAT_PAGE_NUMBER> The current page number of the currently selected output channel. Mnemonic: C<%> is page number in B. =item HANDLE->format_lines_left(EXPR) =item $FORMAT_LINES_LEFT =item $- X<$-> X<$FORMAT_LINES_LEFT> The number of lines left on the page of the currently selected output channel. Mnemonic: lines_on_page - lines_printed. =item Handle->format_line_break_characters EXPR =item $FORMAT_LINE_BREAK_CHARACTERS =item $: X<$:> X The current set of characters after which a string may be broken to fill continuation fields (starting with C<^>) in a format. The default is S<" \n-">, to break on a space, newline, or a hyphen. Mnemonic: a "colon" in poetry is a part of a line. =item HANDLE->format_lines_per_page(EXPR) =item $FORMAT_LINES_PER_PAGE =item $= X<$=> X<$FORMAT_LINES_PER_PAGE> The current page length (printable lines) of the currently selected output channel. The default is 60. Mnemonic: = has horizontal lines. =item HANDLE->format_top_name(EXPR) =item $FORMAT_TOP_NAME =item $^ X<$^> X<$FORMAT_TOP_NAME> The name of the current top-of-page format for the currently selected output channel. The default is the name of the filehandle with C<_TOP> appended. For example, the default format top name for the C filehandle is C. Mnemonic: points to top of page. =item HANDLE->format_name(EXPR) =item $FORMAT_NAME =item $~ X<$~> X<$FORMAT_NAME> The name of the current report format for the currently selected output channel. The default format name is the same as the filehandle name. For example, the default format name for the C filehandle is just C. Mnemonic: brother to C<$^>. =back =head2 Error Variables X X The variables C<$@>, C<$!>, C<$^E>, and C<$?> contain information about different types of error conditions that may appear during execution of a Perl program. The variables are shown ordered by the "distance" between the subsystem which reported the error and the Perl process. They correspond to errors detected by the Perl interpreter, C library, operating system, or an external program, respectively. To illustrate the differences between these variables, consider the following Perl expression, which uses a single-quoted string. After execution of this statement, perl may have set all four special error variables: eval q{ open my $pipe, "/cdrom/install |" or die $!; my @res = <$pipe>; close $pipe or die "bad pipe: $?, $!"; }; When perl executes the C expression, it translates the C, C<< >>, and C calls in the C run-time library and thence to the operating system kernel. perl sets C<$!> to the C library's C if one of these calls fails. C<$@> is set if the string to be C-ed did not compile (this may happen if C or C were imported with bad prototypes), or if Perl code executed during evaluation Cd. In these cases the value of C<$@> is the compile error, or the argument to C (which will interpolate C<$!> and C<$?>). (See also L, though.) Under a few operating systems, C<$^E> may contain a more verbose error indicator, such as in this case, "CDROM tray not closed." Systems that do not support extended error messages leave C<$^E> the same as C<$!>. Finally, C<$?> may be set to non-0 value if the external program F fails. The upper eight bits reflect specific error conditions encountered by the program (the program's C value). The lower eight bits reflect mode of failure, like signal death and core dump information. See L for details. In contrast to C<$!> and C<$^E>, which are set only if error condition is detected, the variable C<$?> is set on each C or pipe C, overwriting the old value. This is more like C<$@>, which on every C is always set on failure and cleared on success. For more details, see the individual descriptions at C<$@>, C<$!>, C<$^E>, and C<$?>. =over 8 =item ${^CHILD_ERROR_NATIVE} X<$^CHILD_ERROR_NATIVE> The native status returned by the last pipe close, backtick (C<``>) command, successful call to C or C, or from the C operator. On POSIX-like systems this value can be decoded with the WIFEXITED, WEXITSTATUS, WIFSIGNALED, WTERMSIG, WIFSTOPPED, WSTOPSIG and WIFCONTINUED functions provided by the L module. Under VMS this reflects the actual VMS exit status; i.e. it is the same as C<$?> when the pragma C is in effect. This variable was added in Perl 5.8.9. =item $EXTENDED_OS_ERROR =item $^E X<$^E> X<$EXTENDED_OS_ERROR> Error information specific to the current operating system. At the moment, this differs from C<$!> under only VMS, OS/2, and Win32 (and for MacPerl). On all other platforms, C<$^E> is always just the same as C<$!>. Under VMS, C<$^E> provides the VMS status value from the last system error. This is more specific information about the last system error than that provided by C<$!>. This is particularly important when C<$!> is set to B. Under OS/2, C<$^E> is set to the error code of the last call to OS/2 API either via CRT, or directly from perl. Under Win32, C<$^E> always returns the last error information reported by the Win32 call C which describes the last error from within the Win32 API. Most Win32-specific code will report errors via C<$^E>. ANSI C and Unix-like calls set C and so most portable Perl code will report errors via C<$!>. Caveats mentioned in the description of C<$!> generally apply to C<$^E>, also. This variable was added in Perl 5.003. Mnemonic: Extra error explanation. =item $EXCEPTIONS_BEING_CAUGHT =item $^S X<$^S> X<$EXCEPTIONS_BEING_CAUGHT> Current state of the interpreter. $^S State --------- ------------------- undef Parsing module/eval true (1) Executing an eval false (0) Otherwise The first state may happen in C<$SIG{__DIE__}> and C<$SIG{__WARN__}> handlers. This variable was added in Perl 5.004. =item $WARNING =item $^W X<$^W> X<$WARNING> The current value of the warning switch, initially true if B<-w> was used, false otherwise, but directly modifiable. See also L. Mnemonic: related to the B<-w> switch. =item ${^WARNING_BITS} X<${^WARNING_BITS}> The current set of warning checks enabled by the C pragma. It has the same scoping as the C<$^H> and C<%^H> variables. The exact values are considered internal to the L pragma and may change between versions of Perl. This variable was added in Perl 5.6. =item $OS_ERROR =item $ERRNO =item $! X<$!> X<$ERRNO> X<$OS_ERROR> When referenced, C<$!> retrieves the current value of the C C integer variable. If C<$!> is assigned a numerical value, that value is stored in C. When referenced as a string, C<$!> yields the system error string corresponding to C. Many system or library calls set C if they fail, to indicate the cause of failure. They usually do B set C to zero if they succeed. This means C, hence C<$!>, is meaningful only I after a B: if (open my $fh, "<", $filename) { # Here $! is meaningless. ... } else { # ONLY here is $! meaningful. ... # Already here $! might be meaningless. } # Since here we might have either success or failure, # $! is meaningless. Here, I means that C<$!> may be unrelated to the outcome of the C operator. Assignment to C<$!> is similarly ephemeral. It can be used immediately before invoking the C operator, to set the exit value, or to inspect the system error string corresponding to error I, or to restore C<$!> to a meaningful state. Mnemonic: What just went bang? =item %OS_ERROR =item %ERRNO =item %! X<%!> X<%OS_ERROR> X<%ERRNO> Each element of C<%!> has a true value only if C<$!> is set to that value. For example, C<$!{ENOENT}> is true if and only if the current value of C<$!> is C; that is, if the most recent error was "No such file or directory" (or its moral equivalent: not all operating systems give that exact error, and certainly not all languages). To check if a particular key is meaningful on your system, use C; for a list of legal keys, use C. See L for more information, and also see L. This variable was added in Perl 5.005. =item $CHILD_ERROR =item $? X<$?> X<$CHILD_ERROR> The status returned by the last pipe close, backtick (C<``>) command, successful call to C or C, or from the C operator. This is just the 16-bit status word returned by the traditional Unix C system call (or else is made up to look like it). Thus, the exit value of the subprocess is really (C<<< $? >> 8 >>>), and C<$? & 127> gives which signal, if any, the process died from, and C<$? & 128> reports whether there was a core dump. Additionally, if the C variable is supported in C, its value is returned via C<$?> if any C function fails. If you have installed a signal handler for C, the value of C<$?> will usually be wrong outside that handler. Inside an C subroutine C<$?> contains the value that is going to be given to C. You can modify C<$?> in an C subroutine to change the exit status of your program. For example: END { $? = 1 if $? == 255; # die would make it 255 } Under VMS, the pragma C makes C<$?> reflect the actual VMS exit status, instead of the default emulation of POSIX status; see L for details. Mnemonic: similar to B and B. =item $EVAL_ERROR =item $@ X<$@> X<$EVAL_ERROR> The Perl syntax error message from the last C operator. If C<$@> is the null string, the last C parsed and executed correctly (although the operations you invoked may have failed in the normal fashion). Warning messages are not collected in this variable. You can, however, set up a routine to process warnings by setting C<$SIG{__WARN__}> as described in L. Mnemonic: Where was the syntax error "at"? =back =head2 Variables related to the interpreter state These variables provide information about the current interpreter state. =over 8 =item $COMPILING =item $^C X<$^C> X<$COMPILING> The current value of the flag associated with the B<-c> switch. Mainly of use with B<-MO=...> to allow code to alter its behavior when being compiled, such as for example to C at compile time rather than normal, deferred loading. Setting C<$^C = 1> is similar to calling C. This variable was added in Perl 5.6. =item $DEBUGGING =item $^D X<$^D> X<$DEBUGGING> The current value of the debugging flags. May be read or set. Like its command-line equivalent, you can use numeric or symbolic values, eg C<$^D = 10> or C<$^D = "st">. Mnemonic: value of B<-D> switch. =item ${^ENCODING} X<${^ENCODING}> The I to the C object that is used to convert the source code to Unicode. Thanks to this variable your Perl script does not have to be written in UTF-8. Default is I. The direct manipulation of this variable is highly discouraged. This variable was added in Perl 5.8.2. =item ${^GLOBAL_PHASE} X<${^GLOBAL_PHASE}> The current phase of the perl interpreter. Possible values are: =over 8 =item CONSTRUCT The C is being constructed via C. This value is mostly there for completeness and for use via the underlying C variable C. It's not really possible for Perl code to be executed unless construction of the interpreter is finished. =item START This is the global compile-time. That includes, basically, every C block executed directly or indirectly from during the compile-time of the top-level program. This phase is not called "BEGIN" to avoid confusion with C-blocks, as those are executed during compile-time of any compilation unit, not just the top-level program. A new, localised compile-time entered at run-time, for example by constructs as C are not global interpreter phases, and therefore aren't reflected by C<${^GLOBAL_PHASE}>. =item CHECK Execution of any C blocks. =item INIT Similar to "CHECK", but for C-blocks, not C blocks. =item RUN The main run-time, i.e. the execution of C. =item END Execution of any C blocks. =item DESTRUCT Global destruction. =back Also note that there's no value for UNITCHECK-blocks. That's because those are run for each compilation unit individually, and therefore is not a global interpreter phase. Not every program has to go through each of the possible phases, but transition from one phase to another can only happen in the order described in the above list. An example of all of the phases Perl code can see: BEGIN { print "compile-time: ${^GLOBAL_PHASE}\n" } INIT { print "init-time: ${^GLOBAL_PHASE}\n" } CHECK { print "check-time: ${^GLOBAL_PHASE}\n" } { package Print::Phase; sub new { my ($class, $time) = @_; return bless \$time, $class; } sub DESTROY { my $self = shift; print "$$self: ${^GLOBAL_PHASE}\n"; } } print "run-time: ${^GLOBAL_PHASE}\n"; my $runtime = Print::Phase->new( "lexical variables are garbage collected before END" ); END { print "end-time: ${^GLOBAL_PHASE}\n" } our $destruct = Print::Phase->new( "package variables are garbage collected after END" ); This will print out compile-time: START check-time: CHECK init-time: INIT run-time: RUN lexical variables are garbage collected before END: RUN end-time: END package variables are garbage collected after END: DESTRUCT This variable was added in Perl 5.14.0. =item $^H X<$^H> WARNING: This variable is strictly for internal use only. Its availability, behavior, and contents are subject to change without notice. This variable contains compile-time hints for the Perl interpreter. At the end of compilation of a BLOCK the value of this variable is restored to the value when the interpreter started to compile the BLOCK. When perl begins to parse any block construct that provides a lexical scope (e.g., eval body, required file, subroutine body, loop body, or conditional block), the existing value of C<$^H> is saved, but its value is left unchanged. When the compilation of the block is completed, it regains the saved value. Between the points where its value is saved and restored, code that executes within BEGIN blocks is free to change the value of C<$^H>. This behavior provides the semantic of lexical scoping, and is used in, for instance, the C pragma. The contents should be an integer; different bits of it are used for different pragmatic flags. Here's an example: sub add_100 { $^H |= 0x100 } sub foo { BEGIN { add_100() } bar->baz($boon); } Consider what happens during execution of the BEGIN block. At this point the BEGIN block has already been compiled, but the body of C is still being compiled. The new value of C<$^H> will therefore be visible only while the body of C is being compiled. Substitution of C block with: BEGIN { require strict; strict->import('vars') } demonstrates how C is implemented. Here's a conditional version of the same lexical pragma: BEGIN { require strict; strict->import('vars') if $condition } This variable was added in Perl 5.003. =item %^H X<%^H> The C<%^H> hash provides the same scoping semantic as C<$^H>. This makes it useful for implementation of lexically scoped pragmas. See L. When putting items into C<%^H>, in order to avoid conflicting with other users of the hash there is a convention regarding which keys to use. A module should use only keys that begin with the module's name (the name of its main package) and a "/" character. For example, a module C should use keys such as C. This variable was added in Perl 5.6. =item ${^OPEN} X<${^OPEN}> An internal variable used by PerlIO. A string in two parts, separated by a C<\0> byte, the first part describes the input layers, the second part describes the output layers. This variable was added in Perl 5.8.0. =item $PERLDB =item $^P X<$^P> X<$PERLDB> The internal variable for debugging support. The meanings of the various bits are subject to change, but currently indicate: =over 6 =item 0x01 Debug subroutine enter/exit. =item 0x02 Line-by-line debugging. Causes C subroutine to be called for each statement executed. Also causes saving source code lines (like 0x400). =item 0x04 Switch off optimizations. =item 0x08 Preserve more data for future interactive inspections. =item 0x10 Keep info about source lines on which a subroutine is defined. =item 0x20 Start with single-step on. =item 0x40 Use subroutine address instead of name when reporting. =item 0x80 Report C as well. =item 0x100 Provide informative "file" names for evals based on the place they were compiled. =item 0x200 Provide informative names to anonymous subroutines based on the place they were compiled. =item 0x400 Save source code lines into C<@{"_<$filename"}>. =back Some bits may be relevant at compile-time only, some at run-time only. This is a new mechanism and the details may change. See also L. =item ${^TAINT} X<${^TAINT}> Reflects if taint mode is on or off. 1 for on (the program was run with B<-T>), 0 for off, -1 when only taint warnings are enabled (i.e. with B<-t> or B<-TU>). This variable is read-only. This variable was added in Perl 5.8. =item ${^UNICODE} X<${^UNICODE}> Reflects certain Unicode settings of Perl. See L documentation for the C<-C> switch for more information about the possible values. This variable is set during Perl startup and is thereafter read-only. This variable was added in Perl 5.8.2. =item ${^UTF8CACHE} X<${^UTF8CACHE}> This variable controls the state of the internal UTF-8 offset caching code. 1 for on (the default), 0 for off, -1 to debug the caching code by checking all its results against linear scans, and panicking on any discrepancy. This variable was added in Perl 5.8.9. =item ${^UTF8LOCALE} X<${^UTF8LOCALE}> This variable indicates whether a UTF-8 locale was detected by perl at startup. This information is used by perl when it's in adjust-utf8ness-to-locale mode (as when run with the C<-CL> command-line switch); see L for more info on this. This variable was added in Perl 5.8.8. =back =head2 Deprecated and removed variables Deprecating a variable announces the intent of the perl maintainers to eventually remove the variable from the language. It may still be available despite its status. Using a deprecated variable triggers a warning. Once a variable is removed, its use triggers an error telling you the variable is unsupported. See L for details about error messages. =over 8 =item $OFMT =item $# X<$#> X<$OFMT> C<$#> was a variable that could be used to format printed numbers. After a deprecation cycle, its magic was removed in Perl 5.10 and using it now triggers a warning: C<$# is no longer supported>. This is not the sigil you use in front of an array name to get the last index, like C<$#array>. That's still how you get the last index of an array in Perl. The two have nothing to do with each other. Deprecated in Perl 5. Removed in Perl 5.10. =item $* X<$*> C<$*> was a variable that you could use to enable multiline matching. After a deprecation cycle, its magic was removed in Perl 5.10. Using it now triggers a warning: C<$* is no longer supported>. You should use the C and C regexp modifiers instead. Deprecated in Perl 5. Removed in Perl 5.10. =item $ARRAY_BASE =item $[ X<$[> X<$ARRAY_BASE> This variable stores the index of the first element in an array, and of the first character in a substring. The default is 0, but you could theoretically set it to 1 to make Perl behave more like B (or Fortran) when subscripting and when evaluating the index() and substr() functions. As of release 5 of Perl, assignment to C<$[> is treated as a compiler directive, and cannot influence the behavior of any other file. (That's why you can only assign compile-time constants to it.) Its use is highly discouraged. Prior to Perl 5.10, assignment to C<$[> could be seen from outer lexical scopes in the same file, unlike other compile-time directives (such as L). Using local() on it would bind its value strictly to a lexical block. Now it is always lexically scoped. As of Perl 5.16, it is implemented by the L module. See L for more details on its behaviour. Under C, or C, C<$[> no longer has any effect, and always contains 0. Assigning 0 to it is permitted, but any other value will produce an error. Mnemonic: [ begins subscripts. Deprecated in Perl 5.12. =item $OLD_PERL_VERSION =item $] X<$]> X<$OLD_PERL_VERSION> See L for a more modern representation of the Perl version that allows accurate string comparisons. The version + patchlevel / 1000 of the Perl interpreter. This variable can be used to determine whether the Perl interpreter executing a script is in the right range of versions: warn "No checksumming!\n" if $] < 3.019; The floating point representation can sometimes lead to inaccurate numeric comparisons. See also the documentation of C and C for a convenient way to fail if the running Perl interpreter is too old. Mnemonic: Is this version of perl in the right bracket? =back =cut perlclib.pod000064400000017004147633747270007072 0ustar00=head1 NAME perlclib - Internal replacements for standard C library functions =head1 DESCRIPTION One thing Perl porters should note is that F doesn't tend to use that much of the C standard library internally; you'll see very little use of, for example, the F functions in there. This is because Perl tends to reimplement or abstract standard library functions, so that we know exactly how they're going to operate. This is a reference card for people who are familiar with the C library and who want to do things the Perl way; to tell them which functions they ought to use instead of the more normal C functions. =head2 Conventions In the following tables: =over 3 =item C is a type. =item C

is a pointer. =item C is a number. =item C is a string. =back C, C, C, etc. represent variables of their respective types. =head2 File Operations Instead of the F functions, you should use the Perl abstraction layer. Instead of C types, you need to be handling C types. Don't forget that with the new PerlIO layered I/O abstraction C types may not even be available. See also the C documentation for more information about the following functions: Instead Of: Use: stdin PerlIO_stdin() stdout PerlIO_stdout() stderr PerlIO_stderr() fopen(fn, mode) PerlIO_open(fn, mode) freopen(fn, mode, stream) PerlIO_reopen(fn, mode, perlio) (Deprecated) fflush(stream) PerlIO_flush(perlio) fclose(stream) PerlIO_close(perlio) =head2 File Input and Output Instead Of: Use: fprintf(stream, fmt, ...) PerlIO_printf(perlio, fmt, ...) [f]getc(stream) PerlIO_getc(perlio) [f]putc(stream, n) PerlIO_putc(perlio, n) ungetc(n, stream) PerlIO_ungetc(perlio, n) Note that the PerlIO equivalents of C and C are slightly different from their C library counterparts: fread(p, size, n, stream) PerlIO_read(perlio, buf, numbytes) fwrite(p, size, n, stream) PerlIO_write(perlio, buf, numbytes) fputs(s, stream) PerlIO_puts(perlio, s) There is no equivalent to C; one should use C instead: fgets(s, n, stream) sv_gets(sv, perlio, append) =head2 File Positioning Instead Of: Use: feof(stream) PerlIO_eof(perlio) fseek(stream, n, whence) PerlIO_seek(perlio, n, whence) rewind(stream) PerlIO_rewind(perlio) fgetpos(stream, p) PerlIO_getpos(perlio, sv) fsetpos(stream, p) PerlIO_setpos(perlio, sv) ferror(stream) PerlIO_error(perlio) clearerr(stream) PerlIO_clearerr(perlio) =head2 Memory Management and String Handling Instead Of: Use: t* p = malloc(n) Newx(id, p, n, t) t* p = calloc(n, s) Newxz(id, p, n, t) p = realloc(p, n) Renew(p, n, t) memcpy(dst, src, n) Copy(src, dst, n, t) memmove(dst, src, n) Move(src, dst, n, t) memcpy(dst, src, sizeof(t)) StructCopy(src, dst, t) memset(dst, 0, n * sizeof(t)) Zero(dst, n, t) memzero(dst, 0) Zero(dst, n, char) free(p) Safefree(p) strdup(p) savepv(p) strndup(p, n) savepvn(p, n) (Hey, strndup doesn't exist!) strstr(big, little) instr(big, little) strcmp(s1, s2) strLE(s1, s2) / strEQ(s1, s2) / strGT(s1,s2) strncmp(s1, s2, n) strnNE(s1, s2, n) / strnEQ(s1, s2, n) Notice the different order of arguments to C and C than used in C and C. Most of the time, though, you'll want to be dealing with SVs internally instead of raw C strings: strlen(s) sv_len(sv) strcpy(dt, src) sv_setpv(sv, s) strncpy(dt, src, n) sv_setpvn(sv, s, n) strcat(dt, src) sv_catpv(sv, s) strncat(dt, src) sv_catpvn(sv, s) sprintf(s, fmt, ...) sv_setpvf(sv, fmt, ...) Note also the existence of C and C, combining concatenation with formatting. Sometimes instead of zeroing the allocated heap by using Newxz() you should consider "poisoning" the data. This means writing a bit pattern into it that should be illegal as pointers (and floating point numbers), and also hopefully surprising enough as integers, so that any code attempting to use the data without forethought will break sooner rather than later. Poisoning can be done using the Poison() macros, which have similar arguments to Zero(): PoisonWith(dst, n, t, b) scribble memory with byte b PoisonNew(dst, n, t) equal to PoisonWith(dst, n, t, 0xAB) PoisonFree(dst, n, t) equal to PoisonWith(dst, n, t, 0xEF) Poison(dst, n, t) equal to PoisonFree(dst, n, t) =head2 Character Class Tests There are two types of character class tests that Perl implements: one type deals in Cs and are thus B Unicode aware (and hence deprecated unless you B you should use them) and the other type deal in Cs and know about Unicode properties. In the following table, C is a C, and C is a Unicode codepoint. Instead Of: Use: But better use: isalnum(c) isALNUM(c) isALNUM_uni(u) isalpha(c) isALPHA(c) isALPHA_uni(u) iscntrl(c) isCNTRL(c) isCNTRL_uni(u) isdigit(c) isDIGIT(c) isDIGIT_uni(u) isgraph(c) isGRAPH(c) isGRAPH_uni(u) islower(c) isLOWER(c) isLOWER_uni(u) isprint(c) isPRINT(c) isPRINT_uni(u) ispunct(c) isPUNCT(c) isPUNCT_uni(u) isspace(c) isSPACE(c) isSPACE_uni(u) isupper(c) isUPPER(c) isUPPER_uni(u) isxdigit(c) isXDIGIT(c) isXDIGIT_uni(u) tolower(c) toLOWER(c) toLOWER_uni(u) toupper(c) toUPPER(c) toUPPER_uni(u) =head2 F functions Instead Of: Use: atof(s) Atof(s) atol(s) Atol(s) strtod(s, &p) Nothing. Just don't use it. strtol(s, &p, n) Strtol(s, &p, n) strtoul(s, &p, n) Strtoul(s, &p, n) Notice also the C, C, and C functions in F for converting strings representing numbers in the respective bases into Cs. In theory C and C may not be defined if the machine perl is built on doesn't actually have strtol and strtoul. But as those 2 functions are part of the 1989 ANSI C spec we suspect you'll find them everywhere by now. int rand() double Drand01() srand(n) { seedDrand01((Rand_seed_t)n); PL_srand_called = TRUE; } exit(n) my_exit(n) system(s) Don't. Look at pp_system or use my_popen getenv(s) PerlEnv_getenv(s) setenv(s, val) my_putenv(s, val) =head2 Miscellaneous functions You should not even B to use F functions, but if you think you do, use the C stack in F instead. For C/C, use C. =head1 SEE ALSO L, L, L perl5143delta.pod000064400000017120147633747270007566 0ustar00=encoding utf8 =head1 NAME perl5143delta - what is new for perl v5.14.3 =head1 DESCRIPTION This document describes differences between the 5.14.2 release and the 5.14.3 release. If you are upgrading from an earlier release such as 5.12.0, first read L, which describes differences between 5.12.0 and 5.14.0. =head1 Core Enhancements No changes since 5.14.0. =head1 Security =head2 C unsafe use of eval (CVE-2011-3597) The Cnew()> function did not properly sanitize input before using it in an eval() call, which could lead to the injection of arbitrary Perl code. In order to exploit this flaw, the attacker would need to be able to set the algorithm name used, or be able to execute arbitrary Perl code already. This problem has been fixed. =head2 Heap buffer overrun in 'x' string repeat operator (CVE-2012-5195) Poorly written perl code that allows an attacker to specify the count to perl's 'x' string repeat operator can already cause a memory exhaustion denial-of-service attack. A flaw in versions of perl before 5.15.5 can escalate that into a heap buffer overrun; coupled with versions of glibc before 2.16, it possibly allows the execution of arbitrary code. This problem has been fixed. =head1 Incompatible Changes There are no changes intentionally incompatible with 5.14.0. If any exist, they are bugs and reports are welcome. =head1 Deprecations There have been no deprecations since 5.14.0. =head1 Modules and Pragmata =head2 New Modules and Pragmata None =head2 Updated Modules and Pragmata =over 4 =item * L was updated to fix a bug in which opening a filehandle to a glob copy caused assertion failures (under debugging) or hangs or other erratic behaviour without debugging. =item * L and L were updated to allow building on GNU/Hurd. =item * L has been updated to fix a regression introduced in perl 5.12, which broke C. [perl #95748] =item * L has been upgraded from version 1.16 to 1.16_01. See L. =item * L has been updated to version 2.49_04 to add data for this release. =back =head2 Removed Modules and Pragmata None =head1 Documentation =head2 New Documentation None =head2 Changes to Existing Documentation =head3 L =over 4 =item * L was updated to 5.14. =back =head1 Configuration and Compilation =over 4 =item * h2ph was updated to search correctly gcc include directories on platforms such as Debian with multi-architecture support. =item * In Configure, the test for procselfexe was refactored into a loop. =back =head1 Platform Support =head2 New Platforms None =head2 Discontinued Platforms None =head2 Platform-Specific Notes =over 4 =item FreeBSD The FreeBSD hints file was corrected to be compatible with FreeBSD 10.0. =item Solaris and NetBSD Configure was updated for "procselfexe" support on Solaris and NetBSD. =item HP-UX README.hpux was updated to note the existence of a broken header in HP-UX 11.00. =item Linux libutil is no longer used when compiling on Linux platforms, which avoids warnings being emitted. The system gcc (rather than any other gcc which might be in the compiling user's path) is now used when searching for libraries such as C<-lm>. =item Mac OS X The locale tests were updated to reflect the behaviour of locales in Mountain Lion. =item GNU/Hurd Various build and test fixes were included for GNU/Hurd. LFS support was enabled in GNU/Hurd. =item NetBSD The NetBSD hints file was corrected to be compatible with NetBSD 6.* =back =head1 Bug Fixes =over 4 =item * A regression has been fixed that was introduced in 5.14, in C regular expression matching, in which a match improperly fails if the pattern is in UTF-8, the target string is not, and a Latin-1 character precedes a character in the string that should match the pattern. [perl #101710] =item * In case-insensitive regular expression pattern matching, no longer on UTF-8 encoded strings does the scan for the start of match only look at the first possible position. This caused matches such as C<"f\x{FB00}" =~ /ff/i> to fail. =item * The sitecustomize support was made relocatableinc aware, so that -Dusesitecustomize and -Duserelocatableinc may be used together. =item * The smartmatch operator (C<~~>) was changed so that the right-hand side takes precedence during C operations. =item * A bug has been fixed in the tainting support, in which an C operation on a tainted constant would cause all other constants to become tainted. [perl #64804] =item * A regression has been fixed that was introduced in perl 5.12, whereby tainting errors were not correctly propagated through C. [perl #111654] =item * A regression has been fixed that was introduced in perl 5.14, in which C and C no longer matched the opposite case. [perl #101970] =back =head1 Acknowledgements Perl 5.14.3 represents approximately 12 months of development since Perl 5.14.2 and contains approximately 2,300 lines of changes across 64 files from 22 authors. Perl continues to flourish into its third decade thanks to a vibrant community of users and developers. The following people are known to have contributed the improvements that became Perl 5.14.3: Abigail, Andy Dougherty, Carl Hayter, Chris 'BinGOs' Williams, Dave Rolsky, David Mitchell, Dominic Hargreaves, Father Chrysostomos, Florian Ragwitz, H.Merijn Brand, Jilles Tjoelker, Karl Williamson, Leon Timmermans, Michael G Schwern, Nicholas Clark, Niko Tyni, Pino Toscano, Ricardo Signes, Salvador Fandiño, Samuel Thibault, Steve Hay, Tony Cook. The list above is almost certainly incomplete as it is automatically generated from version control history. In particular, it does not include the names of the (very much appreciated) contributors who reported issues to the Perl bug tracker. Many of the changes included in this version originated in the CPAN modules included in Perl's core. We're grateful to the entire CPAN community for helping Perl to flourish. For a more complete list of all of Perl's historical contributors, please see the F file in the Perl source distribution. =head1 Reporting Bugs If you find what you think is a bug, you might check the articles recently posted to the comp.lang.perl.misc newsgroup and the perl bug database at http://rt.perl.org/perlbug/ . There may also be information at http://www.perl.org/ , the Perl Home Page. If you believe you have an unreported bug, please run the L program included with your release. Be sure to trim your bug down to a tiny but sufficient test case. Your bug report, along with the output of C, will be sent off to perlbug@perl.org to be analysed by the Perl porting team. If the bug you are reporting has security implications, which make it inappropriate to send to a publicly archived mailing list, then please send it to perl5-security-report@perl.org. This points to a closed subscription unarchived mailing list, which includes all the core committers, who be able to help assess the impact of issues, figure out a resolution, and help co-ordinate the release of patches to mitigate or fix the problem across all platforms on which Perl is supported. Please only use this address for security issues in the Perl core, not for modules independently distributed on CPAN. =head1 SEE ALSO The F file for an explanation of how to view exhaustive details on what changed. The F file for how to build Perl. The F file for general stuff. The F and F files for copyright information. =cut perlfaq.pod000064400000054007147633747270006734 0ustar00=head1 NAME perlfaq - frequently asked questions about Perl =head1 DESCRIPTION The perlfaq comprises several documents that answer the most commonly asked questions about Perl and Perl programming. It's divided by topic into nine major sections outlined in this document. =head2 Where to find the perlfaq The perlfaq is an evolving document. Read the latest version at L. It is also included in the standard Perl distribution. =head2 How to use the perlfaq The C command line tool is part of the standard Perl distribution. To read the perlfaq: $ perldoc perlfaq To search the perlfaq question headings: $ perldoc -q open =head2 How to contribute to the perlfaq Review L. If you don't find your suggestion create an issue or pull request against L. Once approved, changes are merged into L, the repository which drives L, and they are distributed with the next Perl 5 release. =head2 What if my question isn't answered in the FAQ? Try the resources in L. =head1 TABLE OF CONTENTS =over 4 =item perlfaq1 - General Questions About Perl =item perlfaq2 - Obtaining and Learning about Perl =item perlfaq3 - Programming Tools =item perlfaq4 - Data Manipulation =item perlfaq5 - Files and Formats =item perlfaq6 - Regular Expressions =item perlfaq7 - General Perl Language Issues =item perlfaq8 - System Interaction =item perlfaq9 - Web, Email and Networking =back =head1 THE QUESTIONS =head2 L: General Questions About Perl This section of the FAQ answers very general, high-level questions about Perl. =over 4 =item * What is Perl? =item * Who supports Perl? Who develops it? Why is it free? =item * Which version of Perl should I use? =item * What are Perl 4, Perl 5, or Perl 6? =item * What is Perl 6? =item * How stable is Perl? =item * Is Perl difficult to learn? =item * How does Perl compare with other languages like Java, Python, REXX, Scheme, or Tcl? =item * Can I do [task] in Perl? =item * When shouldn't I program in Perl? =item * What's the difference between "perl" and "Perl"? =item * What is a JAPH? =item * How can I convince others to use Perl? =back =head2 L: Obtaining and Learning about Perl This section of the FAQ answers questions about where to find source and documentation for Perl, support, and related matters. =over 4 =item * What machines support Perl? Where do I get it? =item * How can I get a binary version of Perl? =item * I don't have a C compiler. How can I build my own Perl interpreter? =item * I copied the Perl binary from one machine to another, but scripts don't work. =item * I grabbed the sources and tried to compile but gdbm/dynamic loading/malloc/linking/... failed. How do I make it work? =item * What modules and extensions are available for Perl? What is CPAN? =item * Where can I get information on Perl? =item * What is perl.com? Perl Mongers? pm.org? perl.org? cpan.org? =item * Where can I post questions? =item * Perl Books =item * Which magazines have Perl content? =item * Which Perl blogs should I read? =item * What mailing lists are there for Perl? =item * Where can I buy a commercial version of Perl? =item * Where do I send bug reports? =back =head2 L: Programming Tools This section of the FAQ answers questions related to programmer tools and programming support. =over 4 =item * How do I do (anything)? =item * How can I use Perl interactively? =item * How do I find which modules are installed on my system? =item * How do I debug my Perl programs? =item * How do I profile my Perl programs? =item * How do I cross-reference my Perl programs? =item * Is there a pretty-printer (formatter) for Perl? =item * Is there an IDE or Windows Perl Editor? =item * Where can I get Perl macros for vi? =item * Where can I get perl-mode or cperl-mode for emacs? =item * How can I use curses with Perl? =item * How can I write a GUI (X, Tk, Gtk, etc.) in Perl? =item * How can I make my Perl program run faster? =item * How can I make my Perl program take less memory? =item * Is it safe to return a reference to local or lexical data? =item * How can I free an array or hash so my program shrinks? =item * How can I make my CGI script more efficient? =item * How can I hide the source for my Perl program? =item * How can I compile my Perl program into byte code or C? =item * How can I get C<#!perl> to work on [MS-DOS,NT,...]? =item * Can I write useful Perl programs on the command line? =item * Why don't Perl one-liners work on my DOS/Mac/VMS system? =item * Where can I learn about CGI or Web programming in Perl? =item * Where can I learn about object-oriented Perl programming? =item * Where can I learn about linking C with Perl? =item * I've read perlembed, perlguts, etc., but I can't embed perl in my C program; what am I doing wrong? =item * When I tried to run my script, I got this message. What does it mean? =item * What's MakeMaker? =back =head2 L: Data Manipulation This section of the FAQ answers questions related to manipulating numbers, dates, strings, arrays, hashes, and miscellaneous data issues. =over 4 =item * Why am I getting long decimals (eg, 19.9499999999999) instead of the numbers I should be getting (eg, 19.95)? =item * Why is int() broken? =item * Why isn't my octal data interpreted correctly? =item * Does Perl have a round() function? What about ceil() and floor()? Trig functions? =item * How do I convert between numeric representations/bases/radixes? =item * Why doesn't & work the way I want it to? =item * How do I multiply matrices? =item * How do I perform an operation on a series of integers? =item * How can I output Roman numerals? =item * Why aren't my random numbers random? =item * How do I get a random number between X and Y? =item * How do I find the day or week of the year? =item * How do I find the current century or millennium? =item * How can I compare two dates and find the difference? =item * How can I take a string and turn it into epoch seconds? =item * How can I find the Julian Day? =item * How do I find yesterday's date? =item * Does Perl have a Year 2000 or 2038 problem? Is Perl Y2K compliant? =item * How do I validate input? =item * How do I unescape a string? =item * How do I remove consecutive pairs of characters? =item * How do I expand function calls in a string? =item * How do I find matching/nesting anything? =item * How do I reverse a string? =item * How do I expand tabs in a string? =item * How do I reformat a paragraph? =item * How can I access or change N characters of a string? =item * How do I change the Nth occurrence of something? =item * How can I count the number of occurrences of a substring within a string? =item * How do I capitalize all the words on one line? =item * How can I split a [character]-delimited string except when inside [character]? =item * How do I strip blank space from the beginning/end of a string? =item * How do I pad a string with blanks or pad a number with zeroes? =item * How do I extract selected columns from a string? =item * How do I find the soundex value of a string? =item * How can I expand variables in text strings? =item * What's wrong with always quoting "$vars"? =item * Why don't my EEHERE documents work? =item * What is the difference between a list and an array? =item * What is the difference between $array[1] and @array[1]? =item * How can I remove duplicate elements from a list or array? =item * How can I tell whether a certain element is contained in a list or array? =item * How do I compute the difference of two arrays? How do I compute the intersection of two arrays? =item * How do I test whether two arrays or hashes are equal? =item * How do I find the first array element for which a condition is true? =item * How do I handle linked lists? =item * How do I handle circular lists? =item * How do I shuffle an array randomly? =item * How do I process/modify each element of an array? =item * How do I select a random element from an array? =item * How do I permute N elements of a list? =item * How do I sort an array by (anything)? =item * How do I manipulate arrays of bits? =item * Why does defined() return true on empty arrays and hashes? =item * How do I process an entire hash? =item * How do I merge two hashes? =item * What happens if I add or remove keys from a hash while iterating over it? =item * How do I look up a hash element by value? =item * How can I know how many entries are in a hash? =item * How do I sort a hash (optionally by value instead of key)? =item * How can I always keep my hash sorted? =item * What's the difference between "delete" and "undef" with hashes? =item * Why don't my tied hashes make the defined/exists distinction? =item * How do I reset an each() operation part-way through? =item * How can I get the unique keys from two hashes? =item * How can I store a multidimensional array in a DBM file? =item * How can I make my hash remember the order I put elements into it? =item * Why does passing a subroutine an undefined element in a hash create it? =item * How can I make the Perl equivalent of a C structure/C++ class/hash or array of hashes or arrays? =item * How can I use a reference as a hash key? =item * How can I check if a key exists in a multilevel hash? =item * How can I prevent addition of unwanted keys into a hash? =item * How do I handle binary data correctly? =item * How do I determine whether a scalar is a number/whole/integer/float? =item * How do I keep persistent data across program calls? =item * How do I print out or copy a recursive data structure? =item * How do I define methods for every class/object? =item * How do I verify a credit card checksum? =item * How do I pack arrays of doubles or floats for XS code? =back =head2 L: Files and Formats This section deals with I/O and the "f" issues: filehandles, flushing, formats, and footers. =over 4 =item * How do I flush/unbuffer an output filehandle? Why must I do this? =item * How do I change, delete, or insert a line in a file, or append to the beginning of a file? =item * How do I count the number of lines in a file? =item * How do I delete the last N lines from a file? =item * How can I use Perl's C<-i> option from within a program? =item * How can I copy a file? =item * How do I make a temporary file name? =item * How can I manipulate fixed-record-length files? =item * How can I make a filehandle local to a subroutine? How do I pass filehandles between subroutines? How do I make an array of filehandles? =item * How can I use a filehandle indirectly? =item * How can I set up a footer format to be used with write()? =item * How can I write() into a string? =item * How can I open a filehandle to a string? =item * How can I output my numbers with commas added? =item * How can I translate tildes (~) in a filename? =item * How come when I open a file read-write it wipes it out? =item * Why do I sometimes get an "Argument list too long" when I use E*E? =item * How can I open a file with a leading "E" or trailing blanks? =item * How can I reliably rename a file? =item * How can I lock a file? =item * Why can't I just open(FH, "Efile.lock")? =item * I still don't get locking. I just want to increment the number in the file. How can I do this? =item * All I want to do is append a small amount of text to the end of a file. Do I still have to use locking? =item * How do I randomly update a binary file? =item * How do I get a file's timestamp in perl? =item * How do I set a file's timestamp in perl? =item * How do I print to more than one file at once? =item * How can I read in an entire file all at once? =item * How can I read in a file by paragraphs? =item * How can I read a single character from a file? From the keyboard? =item * How can I tell whether there's a character waiting on a filehandle? =item * How do I do a C in perl? =item * How do I dup() a filehandle in Perl? =item * How do I close a file descriptor by number? =item * Why can't I use "C:\temp\foo" in DOS paths? Why doesn't `C:\temp\foo.exe` work? =item * Why doesn't glob("*.*") get all the files? =item * Why does Perl let me delete read-only files? Why does C<-i> clobber protected files? Isn't this a bug in Perl? =item * How do I select a random line from a file? =item * Why do I get weird spaces when I print an array of lines? =item * How do I traverse a directory tree? =item * How do I delete a directory tree? =item * How do I copy an entire directory? =back =head2 L: Regular Expressions This section is surprisingly small because the rest of the FAQ is littered with answers involving regular expressions. For example, decoding a URL and checking whether something is a number can be handled with regular expressions, but those answers are found elsewhere in this document (in perlfaq9 : "How do I decode or create those %-encodings on the web" and perlfaq4 : "How do I determine whether a scalar is a number/whole/integer/float", to be precise). =over 4 =item * How can I hope to use regular expressions without creating illegible and unmaintainable code? =item * I'm having trouble matching over more than one line. What's wrong? =item * How can I pull out lines between two patterns that are themselves on different lines? =item * How do I match XML, HTML, or other nasty, ugly things with a regex? =item * I put a regular expression into $/ but it didn't work. What's wrong? =item * How do I substitute case-insensitively on the LHS while preserving case on the RHS? =item * How can I make C<\w> match national character sets? =item * How can I match a locale-smart version of C ? =item * How can I quote a variable to use in a regex? =item * What is C really for? =item * How do I use a regular expression to strip C-style comments from a file? =item * Can I use Perl regular expressions to match balanced text? =item * What does it mean that regexes are greedy? How can I get around it? =item * How do I process each word on each line? =item * How can I print out a word-frequency or line-frequency summary? =item * How can I do approximate matching? =item * How do I efficiently match many regular expressions at once? =item * Why don't word-boundary searches with C<\b> work for me? =item * Why does using $&, $`, or $' slow my program down? =item * What good is C<\G> in a regular expression? =item * Are Perl regexes DFAs or NFAs? Are they POSIX compliant? =item * What's wrong with using grep in a void context? =item * How can I match strings with multibyte characters? =item * How do I match a regular expression that's in a variable? =back =head2 L: General Perl Language Issues This section deals with general Perl language issues that don't clearly fit into any of the other sections. =over 4 =item * Can I get a BNF/yacc/RE for the Perl language? =item * What are all these $@%&* punctuation signs, and how do I know when to use them? =item * Do I always/never have to quote my strings or use semicolons and commas? =item * How do I skip some return values? =item * How do I temporarily block warnings? =item * What's an extension? =item * Why do Perl operators have different precedence than C operators? =item * How do I declare/create a structure? =item * How do I create a module? =item * How do I adopt or take over a module already on CPAN? =item * How do I create a class? =item * How can I tell if a variable is tainted? =item * What's a closure? =item * What is variable suicide and how can I prevent it? =item * How can I pass/return a {Function, FileHandle, Array, Hash, Method, Regex}? =item * How do I create a static variable? =item * What's the difference between dynamic and lexical (static) scoping? Between local() and my()? =item * How can I access a dynamic variable while a similarly named lexical is in scope? =item * What's the difference between deep and shallow binding? =item * Why doesn't "my($foo) = E$fhE;" work right? =item * How do I redefine a builtin function, operator, or method? =item * What's the difference between calling a function as &foo and foo()? =item * How do I create a switch or case statement? =item * How can I catch accesses to undefined variables, functions, or methods? =item * Why can't a method included in this same file be found? =item * How can I find out my current or calling package? =item * How can I comment out a large block of Perl code? =item * How do I clear a package? =item * How can I use a variable as a variable name? =item * What does "bad interpreter" mean? =back =head2 L: System Interaction This section of the Perl FAQ covers questions involving operating system interaction. Topics include interprocess communication (IPC), control over the user-interface (keyboard, screen and pointing devices), and most anything else not related to data manipulation. =over 4 =item * How do I find out which operating system I'm running under? =item * How come exec() doesn't return? =item * How do I do fancy stuff with the keyboard/screen/mouse? =item * How do I print something out in color? =item * How do I read just one key without waiting for a return key? =item * How do I check whether input is ready on the keyboard? =item * How do I clear the screen? =item * How do I get the screen size? =item * How do I ask the user for a password? =item * How do I read and write the serial port? =item * How do I decode encrypted password files? =item * How do I start a process in the background? =item * How do I trap control characters/signals? =item * How do I modify the shadow password file on a Unix system? =item * How do I set the time and date? =item * How can I sleep() or alarm() for under a second? =item * How can I measure time under a second? =item * How can I do an atexit() or setjmp()/longjmp()? (Exception handling) =item * Why doesn't my sockets program work under System V (Solaris)? What does the error message "Protocol not supported" mean? =item * How can I call my system's unique C functions from Perl? =item * Where do I get the include files to do ioctl() or syscall()? =item * Why do setuid perl scripts complain about kernel problems? =item * How can I open a pipe both to and from a command? =item * Why can't I get the output of a command with system()? =item * How can I capture STDERR from an external command? =item * Why doesn't open() return an error when a pipe open fails? =item * What's wrong with using backticks in a void context? =item * How can I call backticks without shell processing? =item * Why can't my script read from STDIN after I gave it EOF (^D on Unix, ^Z on MS-DOS)? =item * How can I convert my shell script to perl? =item * Can I use perl to run a telnet or ftp session? =item * How can I write expect in Perl? =item * Is there a way to hide perl's command line from programs such as "ps"? =item * I {changed directory, modified my environment} in a perl script. How come the change disappeared when I exited the script? How do I get my changes to be visible? =item * How do I close a process's filehandle without waiting for it to complete? =item * How do I fork a daemon process? =item * How do I find out if I'm running interactively or not? =item * How do I timeout a slow event? =item * How do I set CPU limits? =item * How do I avoid zombies on a Unix system? =item * How do I use an SQL database? =item * How do I make a system() exit on control-C? =item * How do I open a file without blocking? =item * How do I tell the difference between errors from the shell and perl? =item * How do I install a module from CPAN? =item * What's the difference between require and use? =item * How do I keep my own module/library directory? =item * How do I add the directory my program lives in to the module/library search path? =item * How do I add a directory to my include path (@INC) at runtime? =item * What is socket.ph and where do I get it? =back =head2 L: Web, Email and Networking This section deals with questions related to running web sites, sending and receiving email as well as general networking. =over 4 =item * Should I use a web framework? =item * Which web framework should I use? =item * What is Plack and PSGI? =item * How do I remove HTML from a string? =item * How do I extract URLs? =item * How do I fetch an HTML file? =item * How do I automate an HTML form submission? =item * How do I decode or create those %-encodings on the web? =item * How do I redirect to another page? =item * How do I put a password on my web pages? =item * How do I make sure users can't enter values into a form that causes my CGI script to do bad things? =item * How do I parse a mail header? =item * How do I check a valid mail address? =item * How do I decode a MIME/BASE64 string? =item * How do I find the user's mail address? =item * How do I send email? =item * How do I use MIME to make an attachment to a mail message? =item * How do I read email? =item * How do I find out my hostname, domainname, or IP address? =item * How do I fetch/put an (S)FTP file? =item * How can I do RPC in Perl? =back =head1 CREDITS Tom Christiansen wrote the original perlfaq then expanded it with the help of Nat Torkington. brian d foy substantialy edited and expanded the perlfaq. perlfaq-workers and others have also supplied feedback, patches and corrections over the years. =head1 AUTHOR AND COPYRIGHT Tom Christiansen wrote the original version of this document. brian d foy C<< >> wrote this version. See the individual perlfaq documents for additional copyright information. This document is available under the same terms as Perl itself. Code examples in all the perlfaq documents are in the public domain. Use them as you see fit (and at your own risk with no warranty from anyone). perlvms.pod000064400000146525147633747270007001 0ustar00=head1 NAME perlvms - VMS-specific documentation for Perl =head1 DESCRIPTION Gathered below are notes describing details of Perl 5's behavior on VMS. They are a supplement to the regular Perl 5 documentation, so we have focussed on the ways in which Perl 5 functions differently under VMS than it does under Unix, and on the interactions between Perl and the rest of the operating system. We haven't tried to duplicate complete descriptions of Perl features from the main Perl documentation, which can be found in the F<[.pod]> subdirectory of the Perl distribution. We hope these notes will save you from confusion and lost sleep when writing Perl scripts on VMS. If you find we've missed something you think should appear here, please don't hesitate to drop a line to vmsperl@perl.org. =head1 Installation Directions for building and installing Perl 5 can be found in the file F in the main source directory of the Perl distribution.. =head1 Organization of Perl Images =head2 Core Images During the installation process, three Perl images are produced. F is an executable image which contains all of the basic functionality of Perl, but cannot take advantage of Perl extensions. It is used to generate several files needed to build the complete Perl and various extensions. Once you've finished installing Perl, you can delete this image. Most of the complete Perl resides in the shareable image F, which provides a core to which the Perl executable image and all Perl extensions are linked. You should place this image in F, or define the logical name F to translate to the full file specification of this image. It should be world readable. (Remember that if a user has execute only access to F, VMS will treat it as if it were a privileged shareable image, and will therefore require all downstream shareable images to be INSTALLed, etc.) Finally, F is an executable image containing the main entry point for Perl, as well as some initialization code. It should be placed in a public directory, and made world executable. In order to run Perl with command line arguments, you should define a foreign command to invoke this image. =head2 Perl Extensions Perl extensions are packages which provide both XS and Perl code to add new functionality to perl. (XS is a meta-language which simplifies writing C code which interacts with Perl, see L for more details.) The Perl code for an extension is treated like any other library module - it's made available in your script through the appropriate C or C statement, and usually defines a Perl package containing the extension. The portion of the extension provided by the XS code may be connected to the rest of Perl in either of two ways. In the B configuration, the object code for the extension is linked directly into F, and is initialized whenever Perl is invoked. In the B configuration, the extension's machine code is placed into a separate shareable image, which is mapped by Perl's DynaLoader when the extension is Cd or Cd in your script. This allows you to maintain the extension as a separate entity, at the cost of keeping track of the additional shareable image. Most extensions can be set up as either static or dynamic. The source code for an extension usually resides in its own directory. At least three files are generally provided: IF<.xs> (where I is the portion of the extension's name following the last C<::>), containing the XS code, IF<.pm>, the Perl library module for the extension, and F, a Perl script which uses the C library modules supplied with Perl to generate a F file for the extension. =head2 Installing static extensions Since static extensions are incorporated directly into F, you'll have to rebuild Perl to incorporate a new extension. You should edit the main F or F you use to build Perl, adding the extension's name to the C macro, and the extension's object file to the C macro. You'll also need to build the extension's object file, either by adding dependencies to the main F, or using a separate F for the extension. Then, rebuild F to incorporate the new code. Finally, you'll need to copy the extension's Perl library module to the F<[.>IF<]> subdirectory under one of the directories in C<@INC>, where I is the name of the extension, with all C<::> replaced by C<.> (e.g. the library module for extension Foo::Bar would be copied to a F<[.Foo.Bar]> subdirectory). =head2 Installing dynamic extensions In general, the distributed kit for a Perl extension includes a file named Makefile.PL, which is a Perl program which is used to create a F file which can be used to build and install the files required by the extension. The kit should be unpacked into a directory tree B under the main Perl source directory, and the procedure for building the extension is simply $ perl Makefile.PL ! Create Descrip.MMS $ mmk ! Build necessary files $ mmk test ! Run test code, if supplied $ mmk install ! Install into public Perl tree I The procedure by which extensions are built and tested creates several levels (at least 4) under the directory in which the extension's source files live. For this reason if you are running a version of VMS prior to V7.1 you shouldn't nest the source directory too deeply in your directory structure lest you exceed RMS' maximum of 8 levels of subdirectory in a filespec. (You can use rooted logical names to get another 8 levels of nesting, if you can't place the files near the top of the physical directory structure.) VMS support for this process in the current release of Perl is sufficient to handle most extensions. However, it does not yet recognize extra libraries required to build shareable images which are part of an extension, so these must be added to the linker options file for the extension by hand. For instance, if the F extension to Perl requires the F shareable image in order to properly link the Perl extension, then the line C must be added to the linker options file F produced during the build process for the Perl extension. By default, the shareable image for an extension is placed in the F<[.lib.site_perl.auto>I.IF<]> directory of the installed Perl directory tree (where I is F or F, and I is the name of the extension, with each C<::> translated to C<.>). (See the MakeMaker documentation for more details on installation options for extensions.) However, it can be manually placed in any of several locations: =over 4 =item * the F<[.Lib.Auto.>II<$PVers>IF<]> subdirectory of one of the directories in C<@INC> (where I is the version of Perl you're using, as supplied in C<$]>, with '.' converted to '_'), or =item * one of the directories in C<@INC>, or =item * a directory which the extensions Perl library module passes to the DynaLoader when asking it to map the shareable image, or =item * F or F. =back If the shareable image isn't in any of these places, you'll need to define a logical name I, where I is the portion of the extension's name after the last C<::>, which translates to the full file specification of the shareable image. =head1 File specifications =head2 Syntax We have tried to make Perl aware of both VMS-style and Unix-style file specifications wherever possible. You may use either style, or both, on the command line and in scripts, but you may not combine the two styles within a single file specification. VMS Perl interprets Unix pathnames in much the same way as the CRTL (I the first component of an absolute path is read as the device name for the VMS file specification). There are a set of functions provided in the C package for explicit interconversion between VMS and Unix syntax; its documentation provides more details. We've tried to minimize the dependence of Perl library modules on Unix syntax, but you may find that some of these, as well as some scripts written for Unix systems, will require that you use Unix syntax, since they will assume that '/' is the directory separator, I If you find instances of this in the Perl distribution itself, please let us know, so we can try to work around them. Also when working on Perl programs on VMS, if you need a syntax in a specific operating system format, then you need either to check the appropriate DECC$ feature logical, or call a conversion routine to force it to that format. The feature logical name DECC$FILENAME_UNIX_REPORT modifies traditional Perl behavior in the conversion of file specifications from Unix to VMS format in order to follow the extended character handling rules now expected by the CRTL. Specifically, when this feature is in effect, the C<./.../> in a Unix path is now translated to C<[.^.^.^.]> instead of the traditional VMS C<[...]>. To be compatible with what MakeMaker expects, if a VMS path cannot be translated to a Unix path, it is passed through unchanged, so C will return C<[...]>. The handling of extended characters is largely complete in the VMS-specific C infrastructure of Perl, but more work is still needed to fully support extended syntax filenames in several core modules. In particular, at this writing PathTools has only partial support for directories containing some extended characters. There are several ambiguous cases where a conversion routine cannot determine whether an input filename is in Unix format or in VMS format, since now both VMS and Unix file specifications may have characters in them that could be mistaken for syntax delimiters of the other type. So some pathnames simply cannot be used in a mode that allows either type of pathname to be present. Perl will tend to assume that an ambiguous filename is in Unix format. Allowing "." as a version delimiter is simply incompatible with determining whether a pathname is in VMS format or in Unix format with extended file syntax. There is no way to know whether "perl-5.8.6" is a Unix "perl-5.8.6" or a VMS "perl-5.8;6" when passing it to unixify() or vmsify(). The DECC$FILENAME_UNIX_REPORT logical name controls how Perl interprets filenames to the extent that Perl uses the CRTL internally for many purposes, and attempts to follow CRTL conventions for reporting filenames. The DECC$FILENAME_UNIX_ONLY feature differs in that it expects all filenames passed to the C run-time to be already in Unix format. This feature is not yet supported in Perl since Perl uses traditional OpenVMS file specifications internally and in the test harness, and it is not yet clear whether this mode will be useful or useable. The feature logical name DECC$POSIX_COMPLIANT_PATHNAMES is new with the RMS Symbolic Link SDK and included with OpenVMS v8.3, but is not yet supported in Perl. =head2 Filename Case Perl follows VMS defaults and override settings in preserving (or not preserving) filename case. Case is not preserved on ODS-2 formatted volumes on any architecture. On ODS-5 volumes, filenames may be case preserved depending on process and feature settings. Perl now honors DECC$EFS_CASE_PRESERVE and DECC$ARGV_PARSE_STYLE on those systems where the CRTL supports these features. When these features are not enabled or the CRTL does not support them, Perl follows the traditional CRTL behavior of downcasing command-line arguments and returning file specifications in lower case only. I It is very easy to get tripped up using a mixture of other programs, external utilities, and Perl scripts that are in varying states of being able to handle case preservation. For example, a file created by an older version of an archive utility or a build utility such as MMK or MMS may generate a filename in all upper case even on an ODS-5 volume. If this filename is later retrieved by a Perl script or module in a case preserving environment, that upper case name may not match the mixed-case or lower-case exceptions of the Perl code. Your best bet is to follow an all-or-nothing approach to case preservation: either don't use it at all, or make sure your entire toolchain and application environment support and use it. OpenVMS Alpha v7.3-1 and later and all version of OpenVMS I64 support case sensitivity as a process setting (see C). Perl does not currently support case sensitivity on VMS, but it may in the future, so Perl programs should use the C<< File::Spec->case_tolerant >> method to determine the state, and not the C<$^O> variable. =head2 Symbolic Links When built on an ODS-5 volume with symbolic links enabled, Perl by default supports symbolic links when the requisite support is available in the filesystem and CRTL (generally 64-bit OpenVMS v8.3 and later). There are a number of limitations and caveats to be aware of when working with symbolic links on VMS. Most notably, the target of a valid symbolic link must be expressed as a Unix-style path and it must exist on a volume visible from your POSIX root (see the C command in DCL help). For further details on symbolic link capabilities and requirements, see chapter 12 of the CRTL manual that ships with OpenVMS v8.3 or later. =head2 Wildcard expansion File specifications containing wildcards are allowed both on the command line and within Perl globs (e.g. C*.cE>). If the wildcard filespec uses VMS syntax, the resultant filespecs will follow VMS syntax; if a Unix-style filespec is passed in, Unix-style filespecs will be returned. Similar to the behavior of wildcard globbing for a Unix shell, one can escape command line wildcards with double quotation marks C<"> around a perl program command line argument. However, owing to the stripping of C<"> characters carried out by the C handling of argv you will need to escape a construct such as this one (in a directory containing the files F, F, F, and F): $ perl -e "print join(' ',@ARGV)" perl.* perl.c perl.exe perl.h perl.obj in the following triple quoted manner: $ perl -e "print join(' ',@ARGV)" """perl.*""" perl.* In both the case of unquoted command line arguments or in calls to C VMS wildcard expansion is performed. (csh-style wildcard expansion is available if you use C.) If the wildcard filespec contains a device or directory specification, then the resultant filespecs will also contain a device and directory; otherwise, device and directory information are removed. VMS-style resultant filespecs will contain a full device and directory, while Unix-style resultant filespecs will contain only as much of a directory path as was present in the input filespec. For example, if your default directory is Perl_Root:[000000], the expansion of C<[.t]*.*> will yield filespecs like "perl_root:[t]base.dir", while the expansion of C will yield filespecs like "t/base.dir". (This is done to match the behavior of glob expansion performed by Unix shells.) Similarly, the resultant filespec will contain the file version only if one was present in the input filespec. =head2 Pipes Input and output pipes to Perl filehandles are supported; the "file name" is passed to lib$spawn() for asynchronous execution. You should be careful to close any pipes you have opened in a Perl script, lest you leave any "orphaned" subprocesses around when Perl exits. You may also use backticks to invoke a DCL subprocess, whose output is used as the return value of the expression. The string between the backticks is handled as if it were the argument to the C operator (see below). In this case, Perl will wait for the subprocess to complete before continuing. The mailbox (MBX) that perl can create to communicate with a pipe defaults to a buffer size of 8192 on 64-bit systems, 512 on VAX. The default buffer size is adjustable via the logical name PERL_MBX_SIZE provided that the value falls between 128 and the SYSGEN parameter MAXBUF inclusive. For example, to set the mailbox size to 32767 use C<$ENV{'PERL_MBX_SIZE'} = 32767;> and then open and use pipe constructs. An alternative would be to issue the command: $ Define PERL_MBX_SIZE 32767 before running your wide record pipe program. A larger value may improve performance at the expense of the BYTLM UAF quota. =head1 PERL5LIB and PERLLIB The PERL5LIB and PERLLIB logical names work as documented in L, except that the element separator is '|' instead of ':'. The directory specifications may use either VMS or Unix syntax. =head1 The Perl Forked Debugger The Perl forked debugger places the debugger commands and output in a separate X-11 terminal window so that commands and output from multiple processes are not mixed together. Perl on VMS supports an emulation of the forked debugger when Perl is run on a VMS system that has X11 support installed. To use the forked debugger, you need to have the default display set to an X-11 Server and some environment variables set that Unix expects. The forked debugger requires the environment variable C to be C, and the environment variable C to exist. C must be in lower case. $define TERM "xterm" $define DISPLAY "hostname:0.0" Currently the value of C is ignored. It is recommended that it be set to be the hostname of the display, the server and screen in Unix notation. In the future the value of DISPLAY may be honored by Perl instead of using the default display. It may be helpful to always use the forked debugger so that script I/O is separated from debugger I/O. You can force the debugger to be forked by assigning a value to the logical name that is not a process identification number. $define PERLDB_PIDS XXXX =head1 PERL_VMS_EXCEPTION_DEBUG The PERL_VMS_EXCEPTION_DEBUG being defined as "ENABLE" will cause the VMS debugger to be invoked if a fatal exception that is not otherwise handled is raised. The purpose of this is to allow debugging of internal Perl problems that would cause such a condition. This allows the programmer to look at the execution stack and variables to find out the cause of the exception. As the debugger is being invoked as the Perl interpreter is about to do a fatal exit, continuing the execution in debug mode is usually not practical. Starting Perl in the VMS debugger may change the program execution profile in a way that such problems are not reproduced. The C function can be used to test this functionality from within a program. In typical VMS style, only the first letter of the value of this logical name is actually checked in a case insensitive mode, and it is considered enabled if it is the value "T","1" or "E". This logical name must be defined before Perl is started. =head1 Command line =head2 I/O redirection and backgrounding Perl for VMS supports redirection of input and output on the command line, using a subset of Bourne shell syntax: =over 4 =item * Cfile> reads stdin from C, =item * Cfile> writes stdout to C, =item * CEfile> appends stdout to C, =item * C<2Efile> writes stderr to C, =item * C<2EEfile> appends stderr to C, and =item * C<< 2>&1 >> redirects stderr to stdout. =back In addition, output may be piped to a subprocess, using the character '|'. Anything after this character on the command line is passed to a subprocess for execution; the subprocess takes the output of Perl as its input. Finally, if the command line ends with '&', the entire command is run in the background as an asynchronous subprocess. =head2 Command line switches The following command line switches behave differently under VMS than described in L. Note also that in order to pass uppercase switches to Perl, you need to enclose them in double-quotes on the command line, since the CRTL downcases all unquoted strings. On newer 64 bit versions of OpenVMS, a process setting now controls if the quoting is needed to preserve the case of command line arguments. =over 4 =item -i If the C<-i> switch is present but no extension for a backup copy is given, then inplace editing creates a new version of a file; the existing copy is not deleted. (Note that if an extension is given, an existing file is renamed to the backup file, as is the case under other operating systems, so it does not remain as a previous version under the original filename.) =item -S If the C<"-S"> or C<-"S"> switch is present I the script name does not contain a directory, then Perl translates the logical name DCL$PATH as a searchlist, using each translation as a directory in which to look for the script. In addition, if no file type is specified, Perl looks in each directory for a file matching the name specified, with a blank type, a type of F<.pl>, and a type of F<.com>, in that order. =item -u The C<-u> switch causes the VMS debugger to be invoked after the Perl program is compiled, but before it has run. It does not create a core dump file. =back =head1 Perl functions As of the time this document was last revised, the following Perl functions were implemented in the VMS port of Perl (functions marked with * are discussed in more detail below): file tests*, abs, alarm, atan, backticks*, binmode*, bless, caller, chdir, chmod, chown, chomp, chop, chr, close, closedir, cos, crypt*, defined, delete, die, do, dump*, each, endgrent, endpwent, eof, eval, exec*, exists, exit, exp, fileno, flock getc, getgrent*, getgrgid*, getgrnam, getlogin, getppid, getpwent*, getpwnam*, getpwuid*, glob, gmtime*, goto, grep, hex, ioctl, import, index, int, join, keys, kill*, last, lc, lcfirst, lchown*, length, link*, local, localtime, log, lstat, m//, map, mkdir, my, next, no, oct, open, opendir, ord, pack, pipe, pop, pos, print, printf, push, q//, qq//, qw//, qx//*, quotemeta, rand, read, readdir, readlink*, redo, ref, rename, require, reset, return, reverse, rewinddir, rindex, rmdir, s///, scalar, seek, seekdir, select(internal), select (system call)*, setgrent, setpwent, shift, sin, sleep, socketpair, sort, splice, split, sprintf, sqrt, srand, stat, study, substr, symlink*, sysread, system*, syswrite, tell, telldir, tie, time, times*, tr///, uc, ucfirst, umask, undef, unlink*, unpack, untie, unshift, use, utime*, values, vec, wait, waitpid*, wantarray, warn, write, y/// The following functions were not implemented in the VMS port, and calling them produces a fatal error (usually) or undefined behavior (rarely, we hope): chroot, dbmclose, dbmopen, fork*, getpgrp, getpriority, msgctl, msgget, msgsend, msgrcv, semctl, semget, semop, setpgrp, setpriority, shmctl, shmget, shmread, shmwrite, syscall The following functions are available on Perls compiled with Dec C 5.2 or greater and running VMS 7.0 or greater: truncate The following functions are available on Perls built on VMS 7.2 or greater: fcntl (without locking) The following functions may or may not be implemented, depending on what type of socket support you've built into your copy of Perl: accept, bind, connect, getpeername, gethostbyname, getnetbyname, getprotobyname, getservbyname, gethostbyaddr, getnetbyaddr, getprotobynumber, getservbyport, gethostent, getnetent, getprotoent, getservent, sethostent, setnetent, setprotoent, setservent, endhostent, endnetent, endprotoent, endservent, getsockname, getsockopt, listen, recv, select(system call)*, send, setsockopt, shutdown, socket The following function is available on Perls built on 64 bit OpenVMS v8.2 with hard links enabled on an ODS-5 formatted build disk. CRTL support is in principle available as of OpenVMS v7.3-1, and better configuration support could detect this. link The following functions are available on Perls built on 64 bit OpenVMS v8.2 and later. CRTL support is in principle available as of OpenVMS v7.3-2, and better configuration support could detect this. getgrgid, getgrnam, getpwnam, getpwuid, setgrent, ttyname The following functions are available on Perls built on 64 bit OpenVMS v8.2 and later. statvfs, socketpair =over 4 =item File tests The tests C<-b>, C<-B>, C<-c>, C<-C>, C<-d>, C<-e>, C<-f>, C<-o>, C<-M>, C<-s>, C<-S>, C<-t>, C<-T>, and C<-z> work as advertised. The return values for C<-r>, C<-w>, and C<-x> tell you whether you can actually access the file; this may not reflect the UIC-based file protections. Since real and effective UIC don't differ under VMS, C<-O>, C<-R>, C<-W>, and C<-X> are equivalent to C<-o>, C<-r>, C<-w>, and C<-x>. Similarly, several other tests, including C<-A>, C<-g>, C<-k>, C<-l>, C<-p>, and C<-u>, aren't particularly meaningful under VMS, and the values returned by these tests reflect whatever your CRTL C routine does to the equivalent bits in the st_mode field. Finally, C<-d> returns true if passed a device specification without an explicit directory (e.g. C), as well as if passed a directory. There are DECC feature logical names AND ODS-5 volume attributes that also control what values are returned for the date fields. Note: Some sites have reported problems when using the file-access tests (C<-r>, C<-w>, and C<-x>) on files accessed via DEC's DFS. Specifically, since DFS does not currently provide access to the extended file header of files on remote volumes, attempts to examine the ACL fail, and the file tests will return false, with C<$!> indicating that the file does not exist. You can use C on these files, since that checks UIC-based protection only, and then manually check the appropriate bits, as defined by your C compiler's F, in the mode value it returns, if you need an approximation of the file's protections. =item backticks Backticks create a subprocess, and pass the enclosed string to it for execution as a DCL command. Since the subprocess is created directly via C, any valid DCL command string may be specified. =item binmode FILEHANDLE The C operator will attempt to insure that no translation of carriage control occurs on input from or output to this filehandle. Since this involves reopening the file and then restoring its file position indicator, if this function returns FALSE, the underlying filehandle may no longer point to an open file, or may point to a different position in the file than before C was called. Note that C is generally not necessary when using normal filehandles; it is provided so that you can control I/O to existing record-structured files when necessary. You can also use the C function in the VMS::Stdio extension to gain finer control of I/O to files and devices with different record structures. =item crypt PLAINTEXT, USER The C operator uses the C system service to generate the hashed representation of PLAINTEXT. If USER is a valid username, the algorithm and salt values are taken from that user's UAF record. If it is not, then the preferred algorithm and a salt of 0 are used. The quadword encrypted value is returned as an 8-character string. The value returned by C may be compared against the encrypted password from the UAF returned by the C functions, in order to authenticate users. If you're going to do this, remember that the encrypted password in the UAF was generated using uppercase username and password strings; you'll have to upcase the arguments to C to insure that you'll get the proper value: sub validate_passwd { my($user,$passwd) = @_; my($pwdhash); if ( !($pwdhash = (getpwnam($user))[1]) || $pwdhash ne crypt("\U$passwd","\U$name") ) { intruder_alert($name); } return 1; } =item die C will force the native VMS exit status to be an SS$_ABORT code if neither of the $! or $? status values are ones that would cause the native status to be interpreted as being what VMS classifies as SEVERE_ERROR severity for DCL error handling. When C is active (see L below), the native VMS exit status value will have either one of the C<$!> or C<$?> or C<$^E> or the Unix value 255 encoded into it in a way that the effective original value can be decoded by other programs written in C, including Perl and the GNV package. As per the normal non-VMS behavior of C if either C<$!> or C<$?> are non-zero, one of those values will be encoded into a native VMS status value. If both of the Unix status values are 0, and the C<$^E> value is set one of ERROR or SEVERE_ERROR severity, then the C<$^E> value will be used as the exit code as is. If none of the above apply, the Unix value of 255 will be encoded into a native VMS exit status value. Please note a significant difference in the behavior of C in the C mode is that it does not force a VMS SEVERE_ERROR status on exit. The Unix exit values of 2 through 255 will be encoded in VMS status values with severity levels of SUCCESS. The Unix exit value of 1 will be encoded in a VMS status value with a severity level of ERROR. This is to be compatible with how the VMS C library encodes these values. The minimum severity level set by C in C mode may be changed to be ERROR or higher in the future depending on the results of testing and further review. See L for a description of the encoding of the Unix value to produce a native VMS status containing it. =item dump Rather than causing Perl to abort and dump core, the C operator invokes the VMS debugger. If you continue to execute the Perl program under the debugger, control will be transferred to the label specified as the argument to C, or, if no label was specified, back to the beginning of the program. All other state of the program (I values of variables, open file handles) are not affected by calling C. =item exec LIST A call to C will cause Perl to exit, and to invoke the command given as an argument to C via C. If the argument begins with '@' or '$' (other than as part of a filespec), then it is executed as a DCL command. Otherwise, the first token on the command line is treated as the filespec of an image to run, and an attempt is made to invoke it (using F<.Exe> and the process defaults to expand the filespec) and pass the rest of C's argument to it as parameters. If the token has no file type, and matches a file with null type, then an attempt is made to determine whether the file is an executable image which should be invoked using C or a text file which should be passed to DCL as a command procedure. =item fork While in principle the C operator could be implemented via (and with the same rather severe limitations as) the CRTL C routine, and while some internal support to do just that is in place, the implementation has never been completed, making C currently unavailable. A true kernel C is expected in a future version of VMS, and the pseudo-fork based on interpreter threads may be available in a future version of Perl on VMS (see L). In the meantime, use C, backticks, or piped filehandles to create subprocesses. =item getpwent =item getpwnam =item getpwuid These operators obtain the information described in L, if you have the privileges necessary to retrieve the named user's UAF information via C. If not, then only the C<$name>, C<$uid>, and C<$gid> items are returned. The C<$dir> item contains the login directory in VMS syntax, while the C<$comment> item contains the login directory in Unix syntax. The C<$gcos> item contains the owner field from the UAF record. The C<$quota> item is not used. =item gmtime The C operator will function properly if you have a working CRTL C routine, or if the logical name SYS$TIMEZONE_DIFFERENTIAL is defined as the number of seconds which must be added to UTC to yield local time. (This logical name is defined automatically if you are running a version of VMS with built-in UTC support.) If neither of these cases is true, a warning message is printed, and C is returned. =item kill In most cases, C is implemented via the undocumented system service C<$SIGPRC>, which has the same calling sequence as C<$FORCEX>, but throws an exception in the target process rather than forcing it to call C<$EXIT>. Generally speaking, C follows the behavior of the CRTL's C function, but unlike that function can be called from within a signal handler. Also, unlike the C in some versions of the CRTL, Perl's C checks the validity of the signal passed in and returns an error rather than attempting to send an unrecognized signal. Also, negative signal values don't do anything special under VMS; they're just converted to the corresponding positive value. =item qx// See the entry on C above. =item select (system call) If Perl was not built with socket support, the system call version of C functions only for file descriptors attached to sockets. It will not provide information about regular files or pipes, since the CRTL C routine does not provide this functionality. =item stat EXPR Since VMS keeps track of files according to a different scheme than Unix, it's not really possible to represent the file's ID in the C and C fields of a C. Perl tries its best, though, and the values it uses are pretty unlikely to be the same for two different files. We can't guarantee this, though, so caveat scriptor. =item system LIST The C operator creates a subprocess, and passes its arguments to the subprocess for execution as a DCL command. Since the subprocess is created directly via C, any valid DCL command string may be specified. If the string begins with '@', it is treated as a DCL command unconditionally. Otherwise, if the first token contains a character used as a delimiter in file specification (e.g. C<:> or C<]>), an attempt is made to expand it using a default type of F<.Exe> and the process defaults, and if successful, the resulting file is invoked via C. This allows you to invoke an image directly simply by passing the file specification to C, a common Unixish idiom. If the token has no file type, and matches a file with null type, then an attempt is made to determine whether the file is an executable image which should be invoked using C or a text file which should be passed to DCL as a command procedure. If LIST consists of the empty string, C spawns an interactive DCL subprocess, in the same fashion as typing B at the DCL prompt. Perl waits for the subprocess to complete before continuing execution in the current process. As described in L, the return value of C is a fake "status" which follows POSIX semantics unless the pragma C is in effect; see the description of C<$?> in this document for more detail. =item time The value returned by C

. We knew exactly what code would be run, and assumed we could avoid variable collisions and outrageous symbol table growth. Let's say your application is a server that will occasionally run Perl code from some arbitrary file. Your server has no way of knowing what code it's going to run. Very dangerous. If the file is pulled in by C, compiled into a newly constructed interpreter, and subsequently cleaned out with C afterwards, you're shielded from most namespace troubles. One way to avoid namespace collisions in this scenario is to translate the filename into a guaranteed-unique package name, and then compile the code into that package using L. In the example below, each file will only be compiled once. Or, the application might choose to clean out the symbol table associated with the file after it's no longer needed. Using L, We'll call the subroutine C which lives in the file C and pass the filename and boolean cleanup/cache flag as arguments. Note that the process will continue to grow for each file that it uses. In addition, there might be Ced subroutines and other conditions that cause Perl's symbol table to grow. You might want to add some logic that keeps track of the process size, or restarts itself after a certain number of requests, to ensure that memory consumption is minimized. You'll also want to scope your variables with L whenever possible. package Embed::Persistent; #persistent.pl use strict; our %Cache; use Symbol qw(delete_package); sub valid_package_name { my($string) = @_; $string =~ s/([^A-Za-z0-9\/])/sprintf("_%2x",unpack("C",$1))/eg; # second pass only for words starting with a digit $string =~ s|/(\d)|sprintf("/_%2x",unpack("C",$1))|eg; # Dress it up as a real package name $string =~ s|/|::|g; return "Embed" . $string; } sub eval_file { my($filename, $delete) = @_; my $package = valid_package_name($filename); my $mtime = -M $filename; if(defined $Cache{$package}{mtime} && $Cache{$package}{mtime} <= $mtime) { # we have compiled this subroutine already, # it has not been updated on disk, nothing left to do print STDERR "already compiled $package->handler\n"; } else { local *FH; open FH, $filename or die "open '$filename' $!"; local($/) = undef; my $sub = ; close FH; #wrap the code into a subroutine inside our unique package my $eval = qq{package $package; sub handler { $sub; }}; { # hide our variables within this block my($filename,$mtime,$package,$sub); eval $eval; } die $@ if $@; #cache it unless we're cleaning out each time $Cache{$package}{mtime} = $mtime unless $delete; } eval {$package->handler;}; die $@ if $@; delete_package($package) if $delete; #take a look if you want #print Devel::Symdump->rnew($package)->as_string, $/; } 1; __END__ /* persistent.c */ #include #include /* 1 = clean out filename's symbol table after each request, 0 = don't */ #ifndef DO_CLEAN #define DO_CLEAN 0 #endif #define BUFFER_SIZE 1024 static PerlInterpreter *my_perl = NULL; int main(int argc, char **argv, char **env) { char *embedding[] = { "", "persistent.pl" }; char *args[] = { "", DO_CLEAN, NULL }; char filename[BUFFER_SIZE]; int exitstatus = 0; PERL_SYS_INIT3(&argc,&argv,&env); if((my_perl = perl_alloc()) == NULL) { fprintf(stderr, "no memory!"); exit(1); } perl_construct(my_perl); PL_origalen = 1; /* don't let $0 assignment update the proctitle or embedding[0] */ exitstatus = perl_parse(my_perl, NULL, 2, embedding, NULL); PL_exit_flags |= PERL_EXIT_DESTRUCT_END; if(!exitstatus) { exitstatus = perl_run(my_perl); while(printf("Enter file name: ") && fgets(filename, BUFFER_SIZE, stdin)) { filename[strlen(filename)-1] = '\0'; /* strip \n */ /* call the subroutine, passing it the filename as an argument */ args[0] = filename; call_argv("Embed::Persistent::eval_file", G_DISCARD | G_EVAL, args); /* check $@ */ if(SvTRUE(ERRSV)) fprintf(stderr, "eval error: %s\n", SvPV_nolen(ERRSV)); } } PL_perl_destruct_level = 0; perl_destruct(my_perl); perl_free(my_perl); PERL_SYS_TERM(); exit(exitstatus); } Now compile: % cc -o persistent persistent.c `perl -MExtUtils::Embed -e ccopts -e ldopts` Here's an example script file: #test.pl my $string = "hello"; foo($string); sub foo { print "foo says: @_\n"; } Now run: % persistent Enter file name: test.pl foo says: hello Enter file name: test.pl already compiled Embed::test_2epl->handler foo says: hello Enter file name: ^C =head2 Execution of END blocks Traditionally END blocks have been executed at the end of the perl_run. This causes problems for applications that never call perl_run. Since perl 5.7.2 you can specify C to get the new behaviour. This also enables the running of END blocks if the perl_parse fails and C will return the exit value. =head2 $0 assignments When a perl script assigns a value to $0 then the perl runtime will try to make this value show up as the program name reported by "ps" by updating the memory pointed to by the argv passed to perl_parse() and also calling API functions like setproctitle() where available. This behaviour might not be appropriate when embedding perl and can be disabled by assigning the value C<1> to the variable C before perl_parse() is called. The F example above is for instance likely to segfault when $0 is assigned to if the C assignment is removed. This because perl will try to write to the read only memory of the C strings. =head2 Maintaining multiple interpreter instances Some rare applications will need to create more than one interpreter during a session. Such an application might sporadically decide to release any resources associated with the interpreter. The program must take care to ensure that this takes place I the next interpreter is constructed. By default, when perl is not built with any special options, the global variable C is set to C<0>, since extra cleaning isn't usually needed when a program only ever creates a single interpreter in its entire lifetime. Setting C to C<1> makes everything squeaky clean: while(1) { ... /* reset global variables here with PL_perl_destruct_level = 1 */ PL_perl_destruct_level = 1; perl_construct(my_perl); ... /* clean and reset _everything_ during perl_destruct */ PL_perl_destruct_level = 1; perl_destruct(my_perl); perl_free(my_perl); ... /* let's go do it again! */ } When I is called, the interpreter's syntax parse tree and symbol tables are cleaned up, and global variables are reset. The second assignment to C is needed because perl_construct resets it to C<0>. Now suppose we have more than one interpreter instance running at the same time. This is feasible, but only if you used the Configure option C<-Dusemultiplicity> or the options C<-Dusethreads -Duseithreads> when building perl. By default, enabling one of these Configure options sets the per-interpreter global variable C to C<1>, so that thorough cleaning is automatic and interpreter variables are initialized correctly. Even if you don't intend to run two or more interpreters at the same time, but to run them sequentially, like in the above example, it is recommended to build perl with the C<-Dusemultiplicity> option otherwise some interpreter variables may not be initialized correctly between consecutive runs and your application may crash. See also L. Using C<-Dusethreads -Duseithreads> rather than C<-Dusemultiplicity> is more appropriate if you intend to run multiple interpreters concurrently in different threads, because it enables support for linking in the thread libraries of your system with the interpreter. Let's give it a try: #include #include /* we're going to embed two interpreters */ #define SAY_HELLO "-e", "print qq(Hi, I'm $^X\n)" int main(int argc, char **argv, char **env) { PerlInterpreter *one_perl, *two_perl; char *one_args[] = { "one_perl", SAY_HELLO }; char *two_args[] = { "two_perl", SAY_HELLO }; PERL_SYS_INIT3(&argc,&argv,&env); one_perl = perl_alloc(); two_perl = perl_alloc(); PERL_SET_CONTEXT(one_perl); perl_construct(one_perl); PERL_SET_CONTEXT(two_perl); perl_construct(two_perl); PERL_SET_CONTEXT(one_perl); perl_parse(one_perl, NULL, 3, one_args, (char **)NULL); PERL_SET_CONTEXT(two_perl); perl_parse(two_perl, NULL, 3, two_args, (char **)NULL); PERL_SET_CONTEXT(one_perl); perl_run(one_perl); PERL_SET_CONTEXT(two_perl); perl_run(two_perl); PERL_SET_CONTEXT(one_perl); perl_destruct(one_perl); PERL_SET_CONTEXT(two_perl); perl_destruct(two_perl); PERL_SET_CONTEXT(one_perl); perl_free(one_perl); PERL_SET_CONTEXT(two_perl); perl_free(two_perl); PERL_SYS_TERM(); } Note the calls to PERL_SET_CONTEXT(). These are necessary to initialize the global state that tracks which interpreter is the "current" one on the particular process or thread that may be running it. It should always be used if you have more than one interpreter and are making perl API calls on both interpreters in an interleaved fashion. PERL_SET_CONTEXT(interp) should also be called whenever C is used by a thread that did not create it (using either perl_alloc(), or the more esoteric perl_clone()). Compile as usual: % cc -o multiplicity multiplicity.c `perl -MExtUtils::Embed -e ccopts -e ldopts` Run it, Run it: % multiplicity Hi, I'm one_perl Hi, I'm two_perl =head2 Using Perl modules, which themselves use C libraries, from your C program If you've played with the examples above and tried to embed a script that Is a Perl module (such as I) which itself uses a C or C++ library, this probably happened: Can't load module Socket, dynamic loading not available in this perl. (You may need to build a new perl executable which either supports dynamic loading or has the Socket module statically linked into it.) What's wrong? Your interpreter doesn't know how to communicate with these extensions on its own. A little glue will help. Up until now you've been calling I, handing it NULL for the second argument: perl_parse(my_perl, NULL, argc, my_argv, NULL); That's where the glue code can be inserted to create the initial contact between Perl and linked C/C++ routines. Let's take a look some pieces of I to see how Perl does this: static void xs_init (pTHX); EXTERN_C void boot_DynaLoader (pTHX_ CV* cv); EXTERN_C void boot_Socket (pTHX_ CV* cv); EXTERN_C void xs_init(pTHX) { char *file = __FILE__; /* DynaLoader is a special case */ newXS("DynaLoader::boot_DynaLoader", boot_DynaLoader, file); newXS("Socket::bootstrap", boot_Socket, file); } Simply put: for each extension linked with your Perl executable (determined during its initial configuration on your computer or when adding a new extension), a Perl subroutine is created to incorporate the extension's routines. Normally, that subroutine is named I and is invoked when you say I. In turn, this hooks into an XSUB, I, which creates a Perl counterpart for each of the extension's XSUBs. Don't worry about this part; leave that to the I and extension authors. If your extension is dynamically loaded, DynaLoader creates I for you on the fly. In fact, if you have a working DynaLoader then there is rarely any need to link in any other extensions statically. Once you have this code, slap it into the second argument of I: perl_parse(my_perl, xs_init, argc, my_argv, NULL); Then compile: % cc -o interp interp.c `perl -MExtUtils::Embed -e ccopts -e ldopts` % interp use Socket; use SomeDynamicallyLoadedModule; print "Now I can use extensions!\n"' B can also automate writing the I glue code. % perl -MExtUtils::Embed -e xsinit -- -o perlxsi.c % cc -c perlxsi.c `perl -MExtUtils::Embed -e ccopts` % cc -c interp.c `perl -MExtUtils::Embed -e ccopts` % cc -o interp perlxsi.o interp.o `perl -MExtUtils::Embed -e ldopts` Consult L, L, and L for more details. =head1 Hiding Perl_ If you completely hide the short forms of the Perl public API, add -DPERL_NO_SHORT_NAMES to the compilation flags. This means that for example instead of writing warn("%d bottles of beer on the wall", bottlecount); you will have to write the explicit full form Perl_warn(aTHX_ "%d bottles of beer on the wall", bottlecount); (See L for the explanation of the C. ) Hiding the short forms is very useful for avoiding all sorts of nasty (C preprocessor or otherwise) conflicts with other software packages (Perl defines about 2400 APIs with these short names, take or leave few hundred, so there certainly is room for conflict.) =head1 MORAL You can sometimes I in C, but you can always I in Perl. Because you can use each from the other, combine them as you wish. =head1 AUTHOR Jon Orwant > and Doug MacEachern >, with small contributions from Tim Bunce, Tom Christiansen, Guy Decoux, Hallvard Furuseth, Dov Grobgeld, and Ilya Zakharevich. Doug MacEachern has an article on embedding in Volume 1, Issue 4 of The Perl Journal ( http://www.tpj.com/ ). Doug is also the developer of the most widely-used Perl embedding: the mod_perl system (perl.apache.org), which embeds Perl in the Apache web server. Oracle, Binary Evolution, ActiveState, and Ben Sugars's nsapi_perl have used this model for Oracle, Netscape and Internet Information Server Perl plugins. =head1 COPYRIGHT Copyright (C) 1995, 1996, 1997, 1998 Doug MacEachern and Jon Orwant. All Rights Reserved. This document may be distributed under the same terms as Perl itself. perlos2.pod000064400000265042147633747270006673 0ustar00If you read this file _as_is_, just ignore the funny characters you see. It is written in the POD format (see perlpod manpage) which is specially designed to be readable as is. =head1 NAME perlos2 - Perl under OS/2, DOS, Win0.3*, Win0.95 and WinNT. =head1 SYNOPSIS One can read this document in the following formats: man perlos2 view perl perlos2 explorer perlos2.html info perlos2 to list some (not all may be available simultaneously), or it may be read I: either as F, or F. To read the F<.INF> version of documentation (B recommended) outside of OS/2, one needs an IBM's reader (may be available on IBM ftp sites (?) (URL anyone?)) or shipped with PC DOS 7.0 and IBM's Visual Age C++ 3.5. A copy of a Win* viewer is contained in the "Just add OS/2 Warp" package ftp://ftp.software.ibm.com/ps/products/os2/tools/jaow/jaow.zip in F. This gives one an access to EMX's F<.INF> docs as well (text form is available in F in EMX's distribution). There is also a different viewer named xview. Note that if you have F or F installed, you can follow WWW links from this document in F<.INF> format. If you have EMX docs installed correctly, you can follow library links (you need to have C working by setting C environment variable as it is described in EMX docs). =cut Contents (This may be a little bit obsolete) perlos2 - Perl under OS/2, DOS, Win0.3*, Win0.95 and WinNT. NAME SYNOPSIS DESCRIPTION - Target - Other OSes - Prerequisites - Starting Perl programs under OS/2 (and DOS and...) - Starting OS/2 (and DOS) programs under Perl Frequently asked questions - "It does not work" - I cannot run external programs - I cannot embed perl into my program, or use perl.dll from my - `` and pipe-open do not work under DOS. - Cannot start find.exe "pattern" file INSTALLATION - Automatic binary installation - Manual binary installation - Warning Accessing documentation - OS/2 .INF file - Plain text - Manpages - HTML - GNU info files - PDF files - LaTeX docs BUILD - The short story - Prerequisites - Getting perl source - Application of the patches - Hand-editing - Making - Testing - Installing the built perl - a.out-style build Build FAQ - Some / became \ in pdksh. - 'errno' - unresolved external - Problems with tr or sed - Some problem (forget which ;-) - Library ... not found - Segfault in make - op/sprintf test failure Specific (mis)features of OS/2 port - setpriority, getpriority - system() - extproc on the first line - Additional modules: - Prebuilt methods: - Prebuilt variables: - Misfeatures - Modifications - Identifying DLLs - Centralized management of resources Perl flavors - perl.exe - perl_.exe - perl__.exe - perl___.exe - Why strange names? - Why dynamic linking? - Why chimera build? ENVIRONMENT - PERLLIB_PREFIX - PERL_BADLANG - PERL_BADFREE - PERL_SH_DIR - USE_PERL_FLOCK - TMP or TEMP Evolution - Text-mode filehandles - Priorities - DLL name mangling: pre 5.6.2 - DLL name mangling: 5.6.2 and beyond - DLL forwarder generation - Threading - Calls to external programs - Memory allocation - Threads BUGS AUTHOR SEE ALSO =head1 DESCRIPTION =head2 Target The target is to make OS/2 one of the best supported platform for using/building/developing Perl and I, as well as make Perl the best language to use under OS/2. The secondary target is to try to make this work under DOS and Win* as well (but not B hard). The current state is quite close to this target. Known limitations: =over 5 =item * Some *nix programs use fork() a lot; with the mostly useful flavors of perl for OS/2 (there are several built simultaneously) this is supported; but some flavors do not support this (e.g., when Perl is called from inside REXX). Using fork() after Iing dynamically loading extensions would not work with I old versions of EMX. =item * You need a separate perl executable F (see L) if you want to use PM code in your application (as Perl/Tk or OpenGL Perl modules do) without having a text-mode window present. While using the standard F from a text-mode window is possible too, I have seen cases when this causes degradation of the system stability. Using F avoids such a degradation. =item * There is no simple way to access WPS objects. The only way I know is via C and C extensions (see L, L). However, we do not have access to convenience methods of Object-REXX. (Is it possible at all? I know of no Object-REXX API.) The C extension (currently in alpha-text) may eventually remove this shortcoming; however, due to the fact that DII is not supported by the C module, using C is not as convenient as one would like it. =back Please keep this list up-to-date by informing me about other items. =head2 Other OSes Since OS/2 port of perl uses a remarkable EMX environment, it can run (and build extensions, and - possibly - be built itself) under any environment which can run EMX. The current list is DOS, DOS-inside-OS/2, Win0.3*, Win0.95 and WinNT. Out of many perl flavors, only one works, see L<"perl_.exe">. Note that not all features of Perl are available under these environments. This depends on the features the I - most probably RSX - decided to implement. Cf. L. =head2 Prerequisites =over 6 =item EMX EMX runtime is required (may be substituted by RSX). Note that it is possible to make F to run under DOS without any external support by binding F/F to it, see C. Note that under DOS for best results one should use RSX runtime, which has much more functions working (like C, C and so on). In fact RSX is required if there is no VCPI present. Note the RSX requires DPMI. Many implementations of DPMI are known to be very buggy, beware! Only the latest runtime is supported, currently C<0.9d fix 03>. Perl may run under earlier versions of EMX, but this is not tested. One can get different parts of EMX from, say ftp://crydee.sai.msu.ru/pub/comp/os/os2/leo/gnu/emx+gcc/ http://hobbes.nmsu.edu/h-browse.php?dir=/pub/os2/dev/emx/v0.9d/ The runtime component should have the name F. B. When using F/F, it is enough to have them on your path. One does not need to specify them explicitly (though this emx perl_.exe -de 0 will work as well.) =item RSX To run Perl on DPMI platforms one needs RSX runtime. This is needed under DOS-inside-OS/2, Win0.3*, Win0.95 and WinNT (see L<"Other OSes">). RSX would not work with VCPI only, as EMX would, it requires DMPI. Having RSX and the latest F one gets a fully functional B<*nix>-ish environment under DOS, say, C, C<``> and pipe-C work. In fact, MakeMaker works (for static build), so one can have Perl development environment under DOS. One can get RSX from, say http://cd.textfiles.com/hobbesos29804/disk1/EMX09C/ ftp://crydee.sai.msu.ru/pub/comp/os/os2/leo/gnu/emx+gcc/contrib/ Contact the author on C. The latest F with DOS hooks is available in http://www.ilyaz.org/software/os2/ as F or under similar names starting with C, C etc. =item HPFS Perl does not care about file systems, but the perl library contains many files with long names, so to install it intact one needs a file system which supports long file names. Note that if you do not plan to build the perl itself, it may be possible to fool EMX to truncate file names. This is not supported, read EMX docs to see how to do it. =item pdksh To start external programs with complicated command lines (like with pipes in between, and/or quoting of arguments), Perl uses an external shell. With EMX port such shell should be named F, and located either in the wired-in-during-compile locations (usually F), or in configurable location (see L<"PERL_SH_DIR">). For best results use EMX pdksh. The standard binary (5.2.14 or later) runs under DOS (with L) as well, see http://www.ilyaz.org/software/os2/ =back =head2 Starting Perl programs under OS/2 (and DOS and...) Start your Perl program F with arguments C the same way as on any other platform, by perl foo.pl arg1 arg2 arg3 If you want to specify perl options C<-my_opts> to the perl itself (as opposed to your program), use perl -my_opts foo.pl arg1 arg2 arg3 Alternately, if you use OS/2-ish shell, like CMD or 4os2, put the following at the start of your perl script: extproc perl -S -my_opts rename your program to F, and start it by typing foo arg1 arg2 arg3 Note that because of stupid OS/2 limitations the full path of the perl script is not available when you use C, thus you are forced to use C<-S> perl switch, and your script should be on the C. As a plus side, if you know a full path to your script, you may still start it with perl ../../blah/foo.cmd arg1 arg2 arg3 (note that the argument C<-my_opts> is taken care of by the C line in your script, see L on the first line>). To understand what the above I does, read perl docs about C<-S> switch - see L, and cmdref about C: view perl perlrun man perlrun view cmdref extproc help extproc or whatever method you prefer. There are also endless possibilities to use I of 4os2, I of WPS and so on... However, if you use *nixish shell (like F supplied in the binary distribution), you need to follow the syntax specified in L. Note that B<-S> switch supports scripts with additional extensions F<.cmd>, F<.btm>, F<.bat>, F<.pl> as well. =head2 Starting OS/2 (and DOS) programs under Perl This is what system() (see L), C<``> (see L), and I (see L) are for. (Avoid exec() (see L) unless you know what you do). Note however that to use some of these operators you need to have a sh-syntax shell installed (see L<"Pdksh">, L<"Frequently asked questions">), and perl should be able to find it (see L<"PERL_SH_DIR">). The cases when the shell is used are: =over =item 1 One-argument system() (see L), exec() (see L) with redirection or shell meta-characters; =item 2 Pipe-open (see L) with the command which contains redirection or shell meta-characters; =item 3 Backticks C<``> (see L) with the command which contains redirection or shell meta-characters; =item 4 If the executable called by system()/exec()/pipe-open()/C<``> is a script with the "magic" C<#!> line or C line which specifies shell; =item 5 If the executable called by system()/exec()/pipe-open()/C<``> is a script without "magic" line, and C<$ENV{EXECSHELL}> is set to shell; =item 6 If the executable called by system()/exec()/pipe-open()/C<``> is not found (is not this remark obsolete?); =item 7 For globbing (see L, L) (obsolete? Perl uses builtin globbing nowadays...). =back For the sake of speed for a common case, in the above algorithms backslashes in the command name are not considered as shell metacharacters. Perl starts scripts which begin with cookies C or C<#!> directly, without an intervention of shell. Perl uses the same algorithm to find the executable as F: if the path on C<#!> line does not work, and contains C, then the directory part of the executable is ignored, and the executable is searched in F<.> and on C. To find arguments for these scripts Perl uses a different algorithm than F: up to 3 arguments are recognized, and trailing whitespace is stripped. If a script does not contain such a cooky, then to avoid calling F, Perl uses the same algorithm as F: if C<$ENV{EXECSHELL}> is set, the script is given as the first argument to this command, if not set, then C<$ENV{COMSPEC} /c> is used (or a hardwired guess if C<$ENV{COMSPEC}> is not set). When starting scripts directly, Perl uses exactly the same algorithm as for the search of script given by B<-S> command-line option: it will look in the current directory, then on components of C<$ENV{PATH}> using the following order of appended extensions: no extension, F<.cmd>, F<.btm>, F<.bat>, F<.pl>. Note that Perl will start to look for scripts only if OS/2 cannot start the specified application, thus C will not look for a script if there is an executable file F I on C. In other words, C is essentially searched twice: once by the OS for an executable, then by Perl for scripts. Note also that executable files on OS/2 can have an arbitrary extension, but F<.exe> will be automatically appended if no dot is present in the name. The workaround is as simple as that: since F and F denote the same file (at list on FAT and HPFS file systems), to start an executable residing in file F (no extension) give an argument C (dot appended) to system(). Perl will start PM programs from VIO (=text-mode) Perl process in a separate PM session; the opposite is not true: when you start a non-PM program from a PM Perl process, Perl would not run it in a separate session. If a separate session is desired, either ensure that shell will be used, as in C, or start it using optional arguments to system() documented in C module. This is considered to be a feature. =head1 Frequently asked questions =head2 "It does not work" Perl binary distributions come with a F script which tries to detect common problems with misconfigured installations. There is a pretty large chance it will discover which step of the installation you managed to goof. C<;-)> =head2 I cannot run external programs =over 4 =item * Did you run your programs with C<-w> switch? See L2 (and DOS) programs under Perl>. =item * Do you try to run I shell commands, like C<`copy a b`> (internal for F), or C<`glob a*b`> (internal for ksh)? You need to specify your shell explicitly, like C<`cmd /c copy a b`>, since Perl cannot deduce which commands are internal to your shell. =back =head2 I cannot embed perl into my program, or use F from my program. =over 4 =item Is your program EMX-compiled with C<-Zmt -Zcrtdll>? Well, nowadays Perl DLL should be usable from a differently compiled program too... If you can run Perl code from REXX scripts (see L), then there are some other aspect of interaction which are overlooked by the current hackish code to support differently-compiled principal programs. If everything else fails, you need to build a stand-alone DLL for perl. Contact me, I did it once. Sockets would not work, as a lot of other stuff. =item Did you use L? Some time ago I had reports it does not work. Nowadays it is checked in the Perl test suite, so grep F<./t> subdirectory of the build tree (as well as F<*.t> files in the F<./lib> subdirectory) to find how it should be done "correctly". =back =head2 C<``> and pipe-C do not work under DOS. This may a variant of just L<"I cannot run external programs">, or a deeper problem. Basically: you I RSX (see L) for these commands to work, and you may need a port of F which understands command arguments. One of such ports is listed in L under RSX. Do not forget to set variable C> as well. DPMI is required for RSX. =head2 Cannot start C The whole idea of the "standard C API to start applications" is that the forms C and C<"foo"> of program arguments are completely interchangeable. F breaks this paradigm; find "pattern" file find pattern file are not equivalent; F cannot be started directly using the above API. One needs a way to surround the doublequotes in some other quoting construction, necessarily having an extra non-Unixish shell in between. Use one of system 'cmd', '/c', 'find "pattern" file'; `cmd /c 'find "pattern" file'` This would start F via F via C via C, but this is a price to pay if you want to use non-conforming program. =head1 INSTALLATION =head2 Automatic binary installation The most convenient way of installing a binary distribution of perl is via perl installer F. Just follow the instructions, and 99% of the installation blues would go away. Note however, that you need to have F on your path, and EMX environment I. The latter means that if you just installed EMX, and made all the needed changes to F, you may need to reboot in between. Check EMX runtime by running emxrev Binary installer also creates a folder on your desktop with some useful objects. If you need to change some aspects of the work of the binary installer, feel free to edit the file F. This may be useful e.g., if you need to run the installer many times and do not want to make many interactive changes in the GUI. B =over 15 =item C may be needed if you change your codepage I perl installation, and the new value is not supported by EMX. See L<"PERL_BADLANG">. =item C see L<"PERL_BADFREE">. =item F This file resides somewhere deep in the location you installed your perl library, find it out by perl -MConfig -le "print $INC{'Config.pm'}" While most important values in this file I updated by the binary installer, some of them may need to be hand-edited. I know no such data, please keep me informed if you find one. Moreover, manual changes to the installed version may need to be accompanied by an edit of this file. =back B. Because of a typo the binary installer of 5.00305 would install a variable C into F. Please remove this variable and put C> instead. =head2 Manual binary installation As of version 5.00305, OS/2 perl binary distribution comes split into 11 components. Unfortunately, to enable configurable binary installation, the file paths in the zip files are not absolute, but relative to some directory. Note that the extraction with the stored paths is still necessary (default with unzip, specify C<-d> to pkunzip). However, you need to know where to extract the files. You need also to manually change entries in F to reflect where did you put the files. Note that if you have some primitive unzipper (like C), you may get a lot of warnings/errors during unzipping. Upgrade to C<(w)unzip>. Below is the sample of what to do to reproduce the configuration on my machine. In F you can press C now, and cut-and-paste from the resulting file - created in the directory you started F from. For each component, we mention environment variables related to each installation directory. Either choose directories to match your values of the variables, or create/append-to variables to take into account the directories. =over 3 =item Perl VIO and PM executables (dynamically linked) unzip perl_exc.zip *.exe *.ico -d f:/emx.add/bin unzip perl_exc.zip *.dll -d f:/emx.add/dll (have the directories with C<*.exe> on PATH, and C<*.dll> on LIBPATH); =item Perl_ VIO executable (statically linked) unzip perl_aou.zip -d f:/emx.add/bin (have the directory on PATH); =item Executables for Perl utilities unzip perl_utl.zip -d f:/emx.add/bin (have the directory on PATH); =item Main Perl library unzip perl_mlb.zip -d f:/perllib/lib If this directory is exactly the same as the prefix which was compiled into F, you do not need to change anything. However, for perl to find the library if you use a different path, you need to C in F, see L<"PERLLIB_PREFIX">. =item Additional Perl modules unzip perl_ste.zip -d f:/perllib/lib/site_perl/5.16.3/ Same remark as above applies. Additionally, if this directory is not one of directories on @INC (and @INC is influenced by C), you need to put this directory and subdirectory F<./os2> in C or C variable. Do not use C unless you have it set already. See L. B<[Check whether this extraction directory is still applicable with the new directory structure layout!]> =item Tools to compile Perl modules unzip perl_blb.zip -d f:/perllib/lib Same remark as for F. =item Manpages for Perl and utilities unzip perl_man.zip -d f:/perllib/man This directory should better be on C. You need to have a working F to access these files. =item Manpages for Perl modules unzip perl_mam.zip -d f:/perllib/man This directory should better be on C. You need to have a working man to access these files. =item Source for Perl documentation unzip perl_pod.zip -d f:/perllib/lib This is used by the C program (see L), and may be used to generate HTML documentation usable by WWW browsers, and documentation in zillions of other formats: C, C, C, C and so on. [Use programs such as F etc.] =item Perl manual in F<.INF> format unzip perl_inf.zip -d d:/os2/book This directory should better be on C. =item Pdksh unzip perl_sh.zip -d f:/bin This is used by perl to run external commands which explicitly require shell, like the commands using I and I. It is also used instead of explicit F. Set C (see L<"PERL_SH_DIR">) if you move F from the above location. B It may be possible to use some other sh-compatible shell (untested). =back After you installed the components you needed and updated the F correspondingly, you need to hand-edit F. This file resides somewhere deep in the location you installed your perl library, find it out by perl -MConfig -le "print $INC{'Config.pm'}" You need to correct all the entries which look like file paths (they currently start with C). =head2 B The automatic and manual perl installation leave precompiled paths inside perl executables. While these paths are overwriteable (see L<"PERLLIB_PREFIX">, L<"PERL_SH_DIR">), some people may prefer binary editing of paths inside the executables/DLLs. =head1 Accessing documentation Depending on how you built/installed perl you may have (otherwise identical) Perl documentation in the following formats: =head2 OS/2 F<.INF> file Most probably the most convenient form. Under OS/2 view it as view perl view perl perlfunc view perl less view perl ExtUtils::MakeMaker (currently the last two may hit a wrong location, but this may improve soon). Under Win* see L<"SYNOPSIS">. If you want to build the docs yourself, and have I, run pod2ipf > perl.ipf in F directory, then ipfc /inf perl.ipf (Expect a lot of errors during the both steps.) Now move it on your BOOKSHELF path. =head2 Plain text If you have perl documentation in the source form, perl utilities installed, and GNU groff installed, you may use perldoc perlfunc perldoc less perldoc ExtUtils::MakeMaker to access the perl documentation in the text form (note that you may get better results using perl manpages). Alternately, try running pod2text on F<.pod> files. =head2 Manpages If you have F installed on your system, and you installed perl manpages, use something like this: man perlfunc man 3 less man ExtUtils.MakeMaker to access documentation for different components of Perl. Start with man perl Note that dot (F<.>) is used as a package separator for documentation for packages, and as usual, sometimes you need to give the section - C<3> above - to avoid shadowing by the I. Make sure that the directory B the directory with manpages is on our C, like this set MANPATH=c:/man;f:/perllib/man for Perl manpages in C etc. =head2 HTML If you have some WWW browser available, installed the Perl documentation in the source form, and Perl utilities, you can build HTML docs. Cd to directory with F<.pod> files, and do like this cd f:/perllib/lib/pod pod2html After this you can direct your browser the file F in this directory, and go ahead with reading docs, like this: explore file:///f:/perllib/lib/pod/perl.html Alternatively you may be able to get these docs prebuilt from CPAN. =head2 GNU C files Users of Emacs would appreciate it very much, especially with C mode loaded. You need to get latest C from C, or, alternately, the prebuilt info pages. =head2 F files for C are available on CPAN (may be for slightly older version of perl). =head2 C docs can be constructed using C. =head1 BUILD Here we discuss how to build Perl under OS/2. =head2 The short story Assume that you are a seasoned porter, so are sure that all the necessary tools are already present on your system, and you know how to get the Perl source distribution. Untar it, change to the extract directory, and gnupatch -p0 < os2\diff.configure sh Configure -des -D prefix=f:/perllib make make test make install make aout_test make aout_install This puts the executables in f:/perllib/bin. Manually move them to the C, manually move the built F to C (here for Perl DLL F<*> is a not-very-meaningful hex checksum), and run make installcmd INSTALLCMDDIR=d:/ir/on/path Assuming that the C-files were put on an appropriate location, this completes the installation of minimal Perl system. (The binary distribution contains also a lot of additional modules, and the documentation in INF format.) What follows is a detailed guide through these steps. =head2 Prerequisites You need to have the latest EMX development environment, the full GNU tool suite (gawk renamed to awk, and GNU F earlier on path than the OS/2 F, same with F, to check use find --version sort --version ). You need the latest version of F installed as F. Check that you have B libraries and headers installed, and - optionally - Berkeley DB headers and libraries, and crypt. Possible locations to get the files: ftp://ftp.uni-heidelberg.de/pub/os2/unix/ http://hobbes.nmsu.edu/h-browse.php?dir=/pub/os2 http://cd.textfiles.com/hobbesos29804/disk1/DEV32/ http://cd.textfiles.com/hobbesos29804/disk1/EMX09C/ It is reported that the following archives contain enough utils to build perl: F, F, F, F, F, F, F, F, F and F (or a later version). Note that all these utilities are known to be available from LEO: ftp://crydee.sai.msu.ru/pub/comp/os/os2/leo/gnu/ Note also that the F and F from the EMX distribution are not suitable for multi-threaded compile (even single-threaded flavor of Perl uses multi-threaded C RTL, for compatibility with XFree86-OS/2). Get a corrected one from http://www.ilyaz.org/software/os2/db_mt.zip If you have I installed already, make sure that no copies or perl are currently running. Later steps of the build may fail since an older version of F loaded into memory may be found. Running C becomes meaningless, since the test are checking a previous build of perl (this situation is detected and reported by F test). Do not forget to unset C in environment. Also make sure that you have F directory on the current drive, and F<.> directory in your C. One may try to correct the latter condition by set BEGINLIBPATH .\. if you use something like F or latest versions of F<4os2.exe>. (Setting BEGINLIBPATH to just C<.> is ignored by the OS/2 kernel.) Make sure your gcc is good for C<-Zomf> linking: run C script in F directory. Check that you have link386 installed. It comes standard with OS/2, but may be not installed due to customization. If typing link386 shows you do not have it, do I, and choose C in I. If you get into link386 prompts, press C to exit. =head2 Getting perl source You need to fetch the latest perl source (including developers releases). With some probability it is located in http://www.cpan.org/src/ http://www.cpan.org/src/unsupported If not, you may need to dig in the indices to find it in the directory of the current maintainer. Quick cycle of developers release may break the OS/2 build time to time, looking into http://www.cpan.org/ports/os2/ may indicate the latest release which was publicly released by the maintainer. Note that the release may include some additional patches to apply to the current source of perl. Extract it like this tar vzxf perl5.00409.tar.gz You may see a message about errors while extracting F. This is because there is a conflict with a similarly-named file F. Change to the directory of extraction. =head2 Application of the patches You need to apply the patches in F<./os2/diff.*> like this: gnupatch -p0 < os2\diff.configure You may also need to apply the patches supplied with the binary distribution of perl. It also makes sense to look on the perl5-porters mailing list for the latest OS/2-related patches (see L). Such patches usually contain strings C and C, so it makes sense looking for these strings. =head2 Hand-editing You may look into the file F<./hints/os2.sh> and correct anything wrong you find there. I do not expect it is needed anywhere. =head2 Making sh Configure -des -D prefix=f:/perllib C means: where to install the resulting perl library. Giving correct prefix you may avoid the need to specify C, see L<"PERLLIB_PREFIX">. I, and about C<-c> option to tr>. The latter is most probably already fixed, if you see it and can trace where the latter spurious warning comes from, please inform me. Now make At some moment the built may die, reporting a I or I>. This means that you do not have F<.> in your LIBPATH, so F cannot find the needed F (treat these hex digits as line noise). After this is fixed the build should finish without a lot of fuss. =head2 Testing Now run make test All tests should succeed (with some of them skipped). If you have the same version of Perl installed, it is crucial that you have C<.> early in your LIBPATH (or in BEGINLIBPATH), otherwise your tests will most probably test the wrong version of Perl. Some tests may generate extra messages similar to =over 4 =item A lot of C in database tests related to Berkeley DB. I If it persists, you may disable this warnings, see L<"PERL_BADFREE">. =item Process terminated by SIGTERM/SIGINT This is a standard message issued by OS/2 applications. *nix applications die in silence. It is considered to be a feature. One can easily disable this by appropriate sighandlers. However the test engine bleeds these message to screen in unexpected moments. Two messages of this kind I be present during testing. =back To get finer test reports, call perl t/harness The report with F failing may look like this: Failed Test Status Wstat Total Fail Failed List of failed ------------------------------------------------------------ io/pipe.t 12 1 8.33% 9 7 tests skipped, plus 56 subtests skipped. Failed 1/195 test scripts, 99.49% okay. 1/6542 subtests failed, 99.98% okay. The reasons for most important skipped tests are: =over 8 =item F =over 4 =item 18 Checks C and C of C - unfortunately, HPFS provides only 2sec time granularity (for compatibility with FAT?). =item 25 Checks C on a filehandle just opened for write - I do not know why this should or should not work. =back =item F Checks C. Tests: =over 4 =item 4 Checks C and C of C - unfortunately, HPFS provides only 2sec time granularity (for compatibility with FAT?). =back =back =head2 Installing the built perl If you haven't yet moved C onto LIBPATH, do it now. Run make install It would put the generated files into needed locations. Manually put F, F and F to a location on your PATH, F to a location on your LIBPATH. Run make installcmd INSTALLCMDDIR=d:/ir/on/path to convert perl utilities to F<.cmd> files and put them on PATH. You need to put F<.EXE>-utilities on path manually. They are installed in C<$prefix/bin>, here C<$prefix> is what you gave to F, see L. If you use C, either move the installed F<*/man/> directories to your C, or modify C to match the location. (One could have avoided this by providing a correct C option to F<./Configure>, or editing F<./config.sh> between configuring and making steps.) =head2 C-style build Proceed as above, but make F (see L<"perl_.exe">) by make perl_ test and install by make aout_test make aout_install Manually put F to a location on your PATH. B The build process for C I about all the dependencies, so you should make sure that anything is up-to-date, say, by doing make perl_dll first. =head1 Building a binary distribution [This section provides a short overview only...] Building should proceed differently depending on whether the version of perl you install is already present and used on your system, or is a new version not yet used. The description below assumes that the version is new, so installing its DLLs and F<.pm> files will not disrupt the operation of your system even if some intermediate steps are not yet fully working. The other cases require a little bit more convoluted procedures. Below I suppose that the current version of Perl is C<5.8.2>, so the executables are named accordingly. =over =item 1. Fully build and test the Perl distribution. Make sure that no tests are failing with C and C targets; fix the bugs in Perl and the Perl test suite detected by these tests. Make sure that C make target runs as clean as possible. Check that F runs fine. =item 2. Fully install Perl, including C target. Copy the generated DLLs to C; copy the numbered Perl executables (as in F) to C; copy C to C as C. Think whether you need backward-compatibility DLLs. In most cases you do not need to install them yet; but sometime this may simplify the following steps. =item 3. Make sure that C can download files from CPAN. If not, you may need to manually install C. =item 4. Install the bundle C perl5.8.2 -MCPAN -e "install Bundle::OS2_default" < nul |& tee 00cpan_i_1 This may take a couple of hours on 1GHz processor (when run the first time). And this should not be necessarily a smooth procedure. Some modules may not specify required dependencies, so one may need to repeat this procedure several times until the results stabilize. perl5.8.2 -MCPAN -e "install Bundle::OS2_default" < nul |& tee 00cpan_i_2 perl5.8.2 -MCPAN -e "install Bundle::OS2_default" < nul |& tee 00cpan_i_3 Even after they stabilize, some tests may fail. Fix as many discovered bugs as possible. Document all the bugs which are not fixed, and all the failures with unknown reasons. Inspect the produced logs F<00cpan_i_1> to find suspiciously skipped tests, and other fishy events. Keep in mind that I of some modules may fail too: for example, the DLLs to update may be already loaded by F. Inspect the C logs (in the example above F<00cpan_i_1> etc) for errors, and install things manually, as in cd $CPANHOME/.cpan/build/Digest-MD5-2.31 make install Some distributions may fail some tests, but you may want to install them anyway (as above, or via C command of C shell-mode). Since this procedure may take quite a long time to complete, it makes sense to "freeze" your CPAN configuration by disabling periodic updates of the local copy of CPAN index: set C to some big value (I use 365), then save the settings CPAN> o conf index_expire 365 CPAN> o conf commit Reset back to the default value C<1> when you are finished. =item 5. When satisfied with the results, rerun the C target. Now you can copy C to C, and install the other OMF-build executables: C etc. They are ready to be used. =item 6. Change to the C<./pod> directory of the build tree, download the Perl logo F, and run ( perl2ipf > perl.ipf ) |& tee 00ipf ipfc /INF perl.ipf |& tee 00inf This produces the Perl docs online book C. Install in on C path. =item 7. Now is the time to build statically linked executable F which includes newly-installed via C modules. Doing testing via C is going to be painfully slow, since it statically links a new executable per XS extension. Here is a possible workaround: create a toplevel F in F<$CPANHOME/.cpan/build/> with contents being (compare with L) use ExtUtils::MakeMaker; WriteMakefile NAME => 'dummy'; execute this as perl_5.8.2.exe Makefile.PL 's in subdirectories may be buggy, and would not run as "child" scripts. The interdependency of modules can strike you; however, since non-XS modules are already installed, the prerequisites of most modules have a very good chance to be present. If you discover some glitches, move directories of problematic modules to a different location; if these modules are non-XS modules, you may just ignore them - they are already installed; the remaining, XS, modules you need to install manually one by one. After each such removal you need to rerun the C/C process; usually this procedure converges soon. (But be sure to convert all the necessary external C libraries from F<.lib> format to F<.a> format: run one of emxaout foo.lib emximp -o foo.a foo.lib whichever is appropriate.) Also, make sure that the DLLs for external libraries are usable with with executables compiled without C<-Zmtd> options. When you are sure that only a few subdirectories lead to failures, you may want to add C<-j4> option to C to speed up skipping subdirectories with already finished build. When you are satisfied with the results of tests, install the build C libraries for extensions: make install |& tee 00aout_i Now you can rename the file F<./perl.exe> generated during the last phase to F; place it on C; if there is an inter-dependency between some XS modules, you may need to repeat the C/C loop with this new executable and some excluded modules - until the procedure converges. Now you have all the necessary F<.a> libraries for these Perl modules in the places where Perl builder can find it. Use the perl builder: change to an empty directory, create a "dummy" F again, and run perl_5.8.2.exe Makefile.PL |& tee 00c make perl |& tee 00p This should create an executable F<./perl.exe> with all the statically loaded extensions built in. Compare the generated F files to make sure that during the iterations the number of loaded extensions only increases. Rename F<./perl.exe> to F on C. When it converges, you got a functional variant of F; copy it to C. You are done with generation of the local Perl installation. =item 8. Make sure that the installed modules are actually installed in the location of the new Perl, and are not inherited from entries of @INC given for inheritance from the older versions of Perl: set C to redirect the new version of Perl to a new location, and copy the installed files to this new location. Redo the tests to make sure that the versions of modules inherited from older versions of Perl are not needed. Actually, the log output of L during the step 6 gives a very detailed info about which modules are loaded from which place; so you may use it as an additional verification tool. Check that some temporary files did not make into the perl install tree. Run something like this pfind . -f "!(/\.(pm|pl|ix|al|h|a|lib|txt|pod|imp|bs|dll|ld|bs|inc|xbm|yml|cgi|uu|e2x|skip|packlist|eg|cfg|html|pub|enc|all|ini|po|pot)$/i or /^\w+$/") | less in the install tree (both top one and F one). Compress all the DLLs with F. The tiny F<.exe> can be compressed with C (the bug only appears when there is a fixup in the last 6 bytes of a page (?); since the tiny executables are much smaller than a page, the bug will not hit). Do not compress C - it would not work under DOS. =item 9. Now you can generate the binary distribution. This is done by running the test of the CPAN distribution C. Tune up the file F to suit the layout of current version of Perl first. Do not forget to pack the necessary external DLLs accordingly. Include the description of the bugs and test suite failures you could not fix. Include the small-stack versions of Perl executables from Perl build directory. Include F so that people can relink the perl DLL preserving the binary compatibility, or can create compatibility DLLs. Include the diff files (C) of fixes you did so that people can rebuild your version. Include F so that one can use remote debugging. =item 10. Share what you did with the other people. Relax. Enjoy fruits of your work. =item 11. Brace yourself for thanks, bug reports, hate mail and spam coming as result of the previous step. No good deed should remain unpunished! =back =head1 Building custom F<.EXE> files The Perl executables can be easily rebuilt at any moment. Moreover, one can use the I interface (see L) to make very customized executables. =head2 Making executables with a custom collection of statically loaded extensions It is a little bit easier to do so while I the list of statically loaded extensions. We discuss this case only here. =over =item 1. Change to an empty directory, and create a placeholder : use ExtUtils::MakeMaker; WriteMakefile NAME => 'dummy'; =item 2. Run it with the flavor of Perl (F or F) you want to rebuild. perl_ Makefile.PL =item 3. Ask it to create new Perl executable: make perl (you may need to manually add C to this commandline on some versions of Perl; the symptom is that the command-line globbing does not work from OS/2 shells with the newly-compiled executable; check with .\perl.exe -wle "print for @ARGV" * ). =item 4. The previous step created F which contains a list of newXS() calls near the end. Removing unnecessary calls, and rerunning make perl will produce a customized executable. =back =head2 Making executables with a custom search-paths The default perl executable is flexible enough to support most usages. However, one may want something yet more flexible; for example, one may want to find Perl DLL relatively to the location of the EXE file; or one may want to ignore the environment when setting the Perl-library search patch, etc. If you fill comfortable with I interface (see L), such things are easy to do repeating the steps outlined in L, and doing more comprehensive edits to main() of F. The people with little desire to understand Perl can just rename main(), and do necessary modification in a custom main() which calls the renamed function in appropriate time. However, there is a third way: perl DLL exports the main() function and several callbacks to customize the search path. Below is a complete example of a "Perl loader" which =over =item 1. Looks for Perl DLL in the directory C<$exedir/../dll>; =item 2. Prepends the above directory to C; =item 3. Fails if the Perl DLL found via C is different from what was loaded on step 1; e.g., another process could have loaded it from C or from a different value of C. In these cases one needs to modify the setting of the system so that this other process either does not run, or loads the DLL from C with C (available with kernels after September 2000). =item 4. Loads Perl library from C<$exedir/../dll/lib/>. =item 5. Uses Bourne shell from C<$exedir/../dll/sh/ksh.exe>. =back For best results compile the C file below with the same options as the Perl DLL. However, a lot of functionality will work even if the executable is not an EMX applications, e.g., if compiled with gcc -Wall -DDOSISH -DOS2=1 -O2 -s -Zomf -Zsys perl-starter.c -DPERL_DLL_BASENAME=\"perl312F\" -Zstack 8192 -Zlinker /PM:VIO Here is the sample C file: #define INCL_DOS #define INCL_NOPM /* These are needed for compile if os2.h includes os2tk.h, not os2emx.h */ #define INCL_DOSPROCESS #include #include "EXTERN.h" #define PERL_IN_MINIPERLMAIN_C #include "perl.h" static char *me; HMODULE handle; static void die_with(char *msg1, char *msg2, char *msg3, char *msg4) { ULONG c; char *s = " error: "; DosWrite(2, me, strlen(me), &c); DosWrite(2, s, strlen(s), &c); DosWrite(2, msg1, strlen(msg1), &c); DosWrite(2, msg2, strlen(msg2), &c); DosWrite(2, msg3, strlen(msg3), &c); DosWrite(2, msg4, strlen(msg4), &c); DosWrite(2, "\r\n", 2, &c); exit(255); } typedef ULONG (*fill_extLibpath_t)(int type, char *pre, char *post, int replace, char *msg); typedef int (*main_t)(int type, char *argv[], char *env[]); typedef int (*handler_t)(void* data, int which); #ifndef PERL_DLL_BASENAME # define PERL_DLL_BASENAME "perl" #endif static HMODULE load_perl_dll(char *basename) { char buf[300], fail[260]; STRLEN l, dirl; fill_extLibpath_t f; ULONG rc_fullname; HMODULE handle, handle1; if (_execname(buf, sizeof(buf) - 13) != 0) die_with("Can't find full path: ", strerror(errno), "", ""); /* XXXX Fill 'me' with new value */ l = strlen(buf); while (l && buf[l-1] != '/' && buf[l-1] != '\\') l--; dirl = l - 1; strcpy(buf + l, basename); l += strlen(basename); strcpy(buf + l, ".dll"); if ( (rc_fullname = DosLoadModule(fail, sizeof fail, buf, &handle)) != 0 && DosLoadModule(fail, sizeof fail, basename, &handle) != 0 ) die_with("Can't load DLL ", buf, "", ""); if (rc_fullname) return handle; /* was loaded with short name; all is fine */ if (DosQueryProcAddr(handle, 0, "fill_extLibpath", (PFN*)&f)) die_with(buf, ": DLL exports no symbol ", "fill_extLibpath", ""); buf[dirl] = 0; if (f(0 /*BEGINLIBPATH*/, buf /* prepend */, NULL /* append */, 0 /* keep old value */, me)) die_with(me, ": prepending BEGINLIBPATH", "", ""); if (DosLoadModule(fail, sizeof fail, basename, &handle1) != 0) die_with(me, ": finding perl DLL again via BEGINLIBPATH", "", ""); buf[dirl] = '\\'; if (handle1 != handle) { if (DosQueryModuleName(handle1, sizeof(fail), fail)) strcpy(fail, "???"); die_with(buf, ":\n\tperl DLL via BEGINLIBPATH is different: \n\t", fail, "\n\tYou may need to manipulate global BEGINLIBPATH and LIBPATHSTRICT" "\n\tso that the other copy is loaded via BEGINLIBPATH."); } return handle; } int main(int argc, char **argv, char **env) { main_t f; handler_t h; me = argv[0]; /**/ handle = load_perl_dll(PERL_DLL_BASENAME); if (DosQueryProcAddr(handle, 0, "Perl_OS2_handler_install", (PFN*)&h)) die_with(PERL_DLL_BASENAME, ": DLL exports no symbol ", "Perl_OS2_handler_install", ""); if ( !h((void *)"~installprefix", Perlos2_handler_perllib_from) || !h((void *)"~dll", Perlos2_handler_perllib_to) || !h((void *)"~dll/sh/ksh.exe", Perlos2_handler_perl_sh) ) die_with(PERL_DLL_BASENAME, ": Can't install @INC manglers", "", ""); if (DosQueryProcAddr(handle, 0, "dll_perlmain", (PFN*)&f)) die_with(PERL_DLL_BASENAME, ": DLL exports no symbol ", "dll_perlmain", ""); return f(argc, argv, env); } =head1 Build FAQ =head2 Some C became C<\> in pdksh. You have a very old pdksh. See L. =head2 C<'errno'> - unresolved external You do not have MT-safe F. See L. =head2 Problems with tr or sed reported with very old version of tr and sed. =head2 Some problem (forget which ;-) You have an older version of F on your LIBPATH, which broke the build of extensions. =head2 Library ... not found You did not run C. See L. =head2 Segfault in make You use an old version of GNU make. See L. =head2 op/sprintf test failure This can result from a bug in emx sprintf which was fixed in 0.9d fix 03. =head1 Specific (mis)features of OS/2 port =head2 C, C Note that these functions are compatible with *nix, not with the older ports of '94 - 95. The priorities are absolute, go from 32 to -95, lower is quicker. 0 is the default priority. B. Calling C on a non-existing process could lock the system before Warp3 fixpak22. Starting with Warp3, Perl will use a workaround: it aborts getpriority() if the process is not present. This is not possible on older versions C<2.*>, and has a race condition anyway. =head2 C Multi-argument form of C allows an additional numeric argument. The meaning of this argument is described in L. When finding a program to run, Perl first asks the OS to look for executables on C (OS/2 adds extension F<.exe> if no extension is present). If not found, it looks for a script with possible extensions added in this order: no extension, F<.cmd>, F<.btm>, F<.bat>, F<.pl>. If found, Perl checks the start of the file for magic strings C<"#!"> and C<"extproc ">. If found, Perl uses the rest of the first line as the beginning of the command line to run this script. The only mangling done to the first line is extraction of arguments (currently up to 3), and ignoring of the path-part of the "interpreter" name if it can't be found using the full path. E.g., C may lead Perl to finding F with the first line being extproc /bin/bash -x -c If F is not found, then Perl looks for an executable F on C. If found in F, then the above system() is translated to system qw(C:/emx.add/bin/bash.exe -x -c C:/emx/bin/foo.cmd bar baz) One additional translation is performed: instead of F Perl uses the hardwired-or-customized shell (see C>). The above search for "interpreter" is recursive: if F executable is not found, but F is found, Perl will investigate its first line etc. The only hardwired limit on the recursion depth is implicit: there is a limit 4 on the number of additional arguments inserted before the actual arguments given to system(). In particular, if no additional arguments are specified on the "magic" first lines, then the limit on the depth is 4. If Perl finds that the found executable is of PM type when the current session is not, it will start the new process in a separate session of necessary type. Call via C to disable this magic. B. Due to the described logic, you need to explicitly specify F<.com> extension if needed. Moreover, if the executable F is requested, Perl will not look for F. [This may change in the future.] =head2 C on the first line If the first chars of a Perl script are C<"extproc ">, this line is treated as C<#!>-line, thus all the switches on this line are processed (twice if script was started via cmd.exe). See L. =head2 Additional modules: L, L, L, L, L. These modules provide access to additional numeric argument for C and to the information about the running process, to DLLs having functions with REXX signature and to the REXX runtime, to OS/2 databases in the F<.INI> format, and to Extended Attributes. Two additional extensions by Andreas Kaiser, C, and C, are included into C directory, mirrored on CPAN. Other OS/2-related extensions are available too. =head2 Prebuilt methods: =over 4 =item C used by C, see L. =item C used by C for DLL name mangling. =item C Self explanatory. =item C leaves drive as it is. =item C changes the "current" drive. =item C means has drive letter and is_rooted. =item C means has leading C<[/\\]> (maybe after a drive-letter:). =item C means changes with current dir. =item C Interface to cwd from EMX. Used by C. =item C Really really odious function to implement. Returns absolute name of file which would have C if CWD were C. C defaults to the current dir. =item C Get current value of extended library search path. If C is present and positive, works with C, if negative, works with C, otherwise with C. =item C Set current value of extended library search path. If C is present and positive, works with , if negative, works with C, otherwise with C. =item C Returns C if it was not called yet, otherwise bit 1 is set if on the previous call do_harderror was enabled, bit 2 is set if on previous call do_exception was enabled. This function enables/disables error popups associated with hardware errors (Disk not ready etc.) and software exceptions. I know of no way to find out the state of popups I the first call to this function. =item C Returns C if it was not called yet, otherwise return false if errors were not requested to be written to a hard drive, or the drive letter if this was requested. This function may redirect error popups associated with hardware errors (Disk not ready etc.) and software exceptions to the file POPUPLOG.OS2 at the root directory of the specified drive. Overrides OS2::Error() specified by individual programs. Given argument undef will disable redirection. Has global effect, persists after the application exits. I know of no way to find out the state of redirection of popups to the disk I the first call to this function. =item OS2::SysInfo() Returns a hash with system information. The keys of the hash are MAX_PATH_LENGTH, MAX_TEXT_SESSIONS, MAX_PM_SESSIONS, MAX_VDM_SESSIONS, BOOT_DRIVE, DYN_PRI_VARIATION, MAX_WAIT, MIN_SLICE, MAX_SLICE, PAGE_SIZE, VERSION_MAJOR, VERSION_MINOR, VERSION_REVISION, MS_COUNT, TIME_LOW, TIME_HIGH, TOTPHYSMEM, TOTRESMEM, TOTAVAILMEM, MAXPRMEM, MAXSHMEM, TIMER_INTERVAL, MAX_COMP_LENGTH, FOREGROUND_FS_SESSION, FOREGROUND_PROCESS =item OS2::BootDrive() Returns a letter without colon. =item C, C Transforms the current application into a PM application and back. The argument true means that a real message loop is going to be served. OS2::MorphPM() returns the PM message queue handle as an integer. See L<"Centralized management of resources"> for additional details. =item C Fake on-demand retrieval of outstanding PM messages. If C is false, will not dispatch messages if a real message loop is known to be present. Returns number of messages retrieved. Dies with "QUITing..." if WM_QUIT message is obtained. =item C Retrieval of PM messages until window creation/destruction. If C is false, will not dispatch messages if a real message loop is known to be present. Returns change in number of windows. If C is given, it is incremented by the number of messages retrieved. Dies with "QUITing..." if WM_QUIT message is obtained. =item C the same as L<_control87(3)> of EMX. Takes integers as arguments, returns the previous coprocessor control word as an integer. Only bits in C which are present in C are changed in the control word. =item OS2::get_control87() gets the coprocessor control word as an integer. =item C The variant of OS2::_control87() with default values good for handling exception mask: if no C, uses exception mask part of C only. If no C, disables all the floating point exceptions. See L<"Misfeatures"> for details. =item C Gives the information about the Perl DLL or the DLL containing the C function bound to by C<&xsub>. The meaning of C is: default (2): full name; 0: handle; 1: module name. =back (Note that some of these may be moved to different libraries - eventually). =head2 Prebuilt variables: =over 4 =item $OS2::emx_rev numeric value is the same as _emx_rev of EMX, a string value the same as _emx_vprt (similar to C<0.9c>). =item $OS2::emx_env same as _emx_env of EMX, a number similar to 0x8001. =item $OS2::os_ver a number C. =item $OS2::is_aout true if the Perl library was compiled in AOUT format. =item $OS2::can_fork true if the current executable is an AOUT EMX executable, so Perl can fork. Do not use this, use the portable check for $Config::Config{dfork}. =item $OS2::nsyserror This variable (default is 1) controls whether to enforce the contents of $^E to start with C-like id. If set to 0, then the string value of $^E is what is available from the OS/2 message file. (Some messages in this file have an C-like id prepended, some not.) =back =head2 Misfeatures =over 4 =item * Since L is present in EMX, but is not functional, it is emulated by perl. To disable the emulations, set environment variable C. =item * Here is the list of things which may be "broken" on EMX (from EMX docs): =over 4 =item * The functions L, L, and L are not implemented. =item * L is not required and not implemented. =item * L is not yet implemented (dummy function). (Perl has a workaround.) =item * L: Special treatment of PID=0, PID=1 and PID=-1 is not implemented. =item * L: WUNTRACED Not implemented. waitpid() is not implemented for negative values of PID. =back Note that C does not work with the current version of EMX. =item * See L<"Text-mode filehandles">. =item * Unix-domain sockets on OS/2 live in a pseudo-file-system C. To avoid a failure to create a socket with a name of a different form, C<"/socket/"> is prepended to the socket name (unless it starts with this already). This may lead to problems later in case the socket is accessed via the "usual" file-system calls using the "initial" name. =item * Apparently, IBM used a compiler (for some period of time around '95?) which changes FP mask right and left. This is not I bad for IBM's programs, but the same compiler was used for DLLs which are used with general-purpose applications. When these DLLs are used, the state of floating-point flags in the application is not predictable. What is much worse, some DLLs change the floating point flags when in _DLLInitTerm() (e.g., F). This means that even if you do not I any function in the DLL, just the act of loading this DLL will reset your flags. What is worse, the same compiler was used to compile some HOOK DLLs. Given that HOOK dlls are executed in the context of I the applications in the system, this means a complete unpredictability of floating point flags on systems using such HOOK DLLs. E.g., F of B origin changes the floating point flags on each write to the TTY of a VIO (windowed text-mode) applications. Some other (not completely debugged) situations when FP flags change include some video drivers (?), and some operations related to creation of the windows. People who code B may have more experience on this. Perl is generally used in the situation when all the floating-point exceptions are ignored, as is the default under EMX. If they are not ignored, some benign Perl programs would get a C and would die a horrible death. To circumvent this, Perl uses two hacks. They help against I type of damage only: FP flags changed when loading a DLL. One of the hacks is to disable floating point exceptions on Perl startup (as is the default with EMX). This helps only with compile-time-linked DLLs changing the flags before main() had a chance to be called. The other hack is to restore FP flags after a call to dlopen(). This helps against similar damage done by DLLs _DLLInitTerm() at runtime. Currently no way to switch these hacks off is provided. =back =head2 Modifications Perl modifies some standard C library calls in the following ways: =over 9 =item C C uses F if shell is required, cf. L<"PERL_SH_DIR">. =item C is created using C or C environment variable, via C. =item C If the current directory is not writable, file is created using modified C, so there may be a race condition. =item C a dummy implementation. =item C C special-cases F and F. =item C, C these EMX functions do not work if the path contains a trailing C. Perl contains a workaround for this. =item C Since L is present in EMX, but is not functional, it is emulated by perl. To disable the emulations, set environment variable C. =back =head2 Identifying DLLs All the DLLs built with the current versions of Perl have ID strings identifying the name of the extension, its version, and the version of Perl required for this DLL. Run C to find this info. =head2 Centralized management of resources Since to call certain OS/2 API one needs to have a correctly initialized C subsystem, OS/2-specific extensions may require getting Cs and Cs. If an extension would do it on its own, another extension could fail to initialize. Perl provides a centralized management of these resources: =over =item C To get the HAB, the extension should call C in C. After this call is performed, C may be accessed as C. There is no need to release the HAB after it is used. If by some reasons F cannot be included, use extern int Perl_hab_GET(void); instead. =item C There are two cases: =over =item * the extension needs an C only because some API will not work otherwise. Use C below. =item * the extension needs an C since it wants to engage in a PM event loop. Use C below. =back To get an C, the extension should call C in C. After this call is performed, C may be accessed as C. To signal to Perl that HMQ is not needed any more, call C. Perl process will automatically morph/unmorph itself into/from a PM process if HMQ is needed/not-needed. Perl will automatically enable/disable C message during shutdown if the message queue is served/not-served. B. If during a shutdown there is a message queue which did not disable WM_QUIT, and which did not process the received WM_QUIT message, the shutdown will be automatically cancelled. Do not call C unless you are going to process messages on an orderly basis. =item Treating errors reported by OS/2 API There are two principal conventions (it is useful to call them C and C - though this part of the function signature is not always determined by the name of the API) of reporting the error conditions of OS/2 API. Most of C APIs report the error code as the result of the call (so 0 means success, and there are many types of errors). Most of C API report success/fail via the result being C/C; to find the reason for the failure one should call WinGetLastError() API. Some C entry points also overload a "meaningful" return value with the error indicator; having a 0 return value indicates an error. Yet some other C entry points overload things even more, and 0 return value may mean a successful call returning a valid value 0, as well as an error condition; in the case of a 0 return value one should call WinGetLastError() API to distinguish a successful call from a failing one. By convention, all the calls to OS/2 API should indicate their failures by resetting $^E. All the Perl-accessible functions which call OS/2 API may be broken into two classes: some die()s when an API error is encountered, the other report the error via a false return value (of course, this does not concern Perl-accessible functions which I a failure of the OS/2 API call, having some workarounds coded). Obviously, in the situation of the last type of the signature of an OS/2 API, it is must more convenient for the users if the failure is indicated by die()ing: one does not need to check $^E to know that something went wrong. If, however, this solution is not desirable by some reason, the code in question should reset $^E to 0 before making this OS/2 API call, so that the caller of this Perl-accessible function has a chance to distinguish a success-but-0-return value from a failure. (One may return undef as an alternative way of reporting an error.) The macros to simplify this type of error propagation are =over =item C Returns true on error, sets $^E. Expects expr() be a call of C-style API. =item C Returns true on error, sets $^E. Expects expr() be a call of C-style API. =item C Returns C, sets $^E from WinGetLastError() if C is false. =item C Returns C, sets $^E from WinGetLastError() if C is false, and die()s if C and $^E are true. The message to die is the concatenated strings C and C, separated by C<": "> from the contents of $^E. =item C Sets C to the return value of WinGetLastError(). =item C Sets C to the return value of WinGetLastError(), and sets $^E to the corresponding value. =item C Sets C to C, and sets $^E to the corresponding value. =back =item Loading DLLs and ordinals in DLLs Some DLLs are only present in some versions of OS/2, or in some configurations of OS/2. Some exported entry points are present only in DLLs shipped with some versions of OS/2. If these DLLs and entry points were linked directly for a Perl executable/DLL or from a Perl extensions, this binary would work only with the specified versions/setups. Even if these entry points were not needed, the I of the executable (or DLL) would fail. For example, many newer useful APIs are not present in OS/2 v2; many PM-related APIs require DLLs not available on floppy-boot setup. To make these calls fail I, one should call these API via a dynamic linking API. There is a subsystem in Perl to simplify such type of calls. A large number of entry points available for such linking is provided (see C - and also C - in F). These ordinals can be accessed via the APIs: CallORD(), DeclFuncByORD(), DeclVoidFuncByORD(), DeclOSFuncByORD(), DeclWinFuncByORD(), AssignFuncPByORD(), DeclWinFuncByORD_CACHE(), DeclWinFuncByORD_CACHE_survive(), DeclWinFuncByORD_CACHE_resetError_survive(), DeclWinFunc_CACHE(), DeclWinFunc_CACHE_resetError(), DeclWinFunc_CACHE_survive(), DeclWinFunc_CACHE_resetError_survive() See the header files and the C code in the supplied OS/2-related modules for the details on usage of these functions. Some of these functions also combine dynaloading semantic with the error-propagation semantic discussed above. =back =head1 Perl flavors Because of idiosyncrasies of OS/2 one cannot have all the eggs in the same basket (though EMX environment tries hard to overcome this limitations, so the situation may somehow improve). There are 4 executables for Perl provided by the distribution: =head2 F The main workhorse. This is a chimera executable: it is compiled as an C-style executable, but is linked with C-style dynamic library F, and with dynamic CRT DLL. This executable is a VIO application. It can load perl dynamic extensions, and it can fork(). B Keep in mind that fork() is needed to open a pipe to yourself. =head2 F This is a statically linked C-style executable. It cannot load dynamic Perl extensions. The executable supplied in binary distributions has a lot of extensions prebuilt, thus the above restriction is important only if you use custom-built extensions. This executable is a VIO application. I The friends locked into C world would appreciate the fact that this executable runs under DOS, Win0.3*, Win0.95 and WinNT with an appropriate extender. See L<"Other OSes">. =head2 F This is the same executable as F, but it is a PM application. B Usually (unless explicitly redirected during the startup) STDIN, STDERR, and STDOUT of a PM application are redirected to F. However, it is possible to I them if you start C from a PM program which emulates a console window, like I of Emacs or EPM. Thus it I to use Perl debugger (see L) to debug your PM application (but beware of the message loop lockups - this will not work if you have a message queue to serve, unless you hook the serving into the getc() function of the debugger). Another way to see the output of a PM program is to run it as pm_prog args 2>&1 | cat - with a shell I from F, so that it does not create a link between a VIO session and the session of C. (Such a link closes the VIO window.) E.g., this works with F - or with Perl! open P, 'pm_prog args 2>&1 |' or die; print while

; The flavor F is required if you want to start your program without a VIO window present, but not Ced (run C for more info). Very useful for extensions which use PM, like C or C. Note also that the differences between PM and VIO executables are only in the I behaviour. One can start I executable in I kind of session by using the arguments C, C or C switches of the command C (of F or a similar shell). Alternatively, one can use the numeric first argument of the C Perl function (see L). =head2 F This is an C-style executable which is dynamically linked to F and CRT DLL. I know no advantages of this executable over C, but it cannot fork() at all. Well, one advantage is that the build process is not so convoluted as with C. It is a VIO application. =head2 Why strange names? Since Perl processes the C<#!>-line (cf. L, L, L), it should know when a program I. There is some naming convention which allows Perl to distinguish correct lines from wrong ones. The above names are almost the only names allowed by this convention which do not contain digits (which have absolutely different semantics). =head2 Why dynamic linking? Well, having several executables dynamically linked to the same huge library has its advantages, but this would not substantiate the additional work to make it compile. The reason is the complicated-to-developers but very quick and convenient-to-users "hard" dynamic linking used by OS/2. There are two distinctive features of the dyna-linking model of OS/2: first, all the references to external functions are resolved at the compile time; second, there is no runtime fixup of the DLLs after they are loaded into memory. The first feature is an enormous advantage over other models: it avoids conflicts when several DLLs used by an application export entries with the same name. In such cases "other" models of dyna-linking just choose between these two entry points using some random criterion - with predictable disasters as results. But it is the second feature which requires the build of F. The address tables of DLLs are patched only once, when they are loaded. The addresses of the entry points into DLLs are guaranteed to be the same for all the programs which use the same DLL. This removes the runtime fixup - once DLL is loaded, its code is read-only. While this allows some (significant?) performance advantages, this makes life much harder for developers, since the above scheme makes it impossible for a DLL to be "linked" to a symbol in the F<.EXE> file. Indeed, this would need a DLL to have different relocations tables for the (different) executables which use this DLL. However, a dynamically loaded Perl extension is forced to use some symbols from the perl executable, e.g., to know how to find the arguments to the functions: the arguments live on the perl internal evaluation stack. The solution is to put the main code of the interpreter into a DLL, and make the F<.EXE> file which just loads this DLL into memory and supplies command-arguments. The extension DLL cannot link to symbols in F<.EXE>, but it has no problem linking to symbols in the F<.DLL>. This I increases the load time for the application (as well as complexity of the compilation). Since interpreter is in a DLL, the C RTL is basically forced to reside in a DLL as well (otherwise extensions would not be able to use CRT). There are some advantages if you use different flavors of perl, such as running F and F simultaneously: they share the memory of F. B. There is one additional effect which makes DLLs more wasteful: DLLs are loaded in the shared memory region, which is a scarse resource given the 512M barrier of the "standard" OS/2 virtual memory. The code of F<.EXE> files is also shared by all the processes which use the particular F<.EXE>, but they are "shared in the private address space of the process"; this is possible because the address at which different sections of the F<.EXE> file are loaded is decided at compile-time, thus all the processes have these sections loaded at same addresses, and no fixup of internal links inside the F<.EXE> is needed. Since DLLs may be loaded at run time, to have the same mechanism for DLLs one needs to have the address range of I DLLs in the system to be available I which did not load a particular DLL yet. This is why the DLLs are mapped to the shared memory region. =head2 Why chimera build? Current EMX environment does not allow DLLs compiled using Unixish C format to export symbols for data (or at least some types of data). This forces C-style compile of F. Current EMX environment does not allow F<.EXE> files compiled in C format to fork(). fork() is needed for exactly three Perl operations: =over 4 =item * explicit fork() in the script, =item * C =item * C, in other words, opening pipes to itself. =back While these operations are not questions of life and death, they are needed for a lot of useful scripts. This forces C-style compile of F. =head1 ENVIRONMENT Here we list environment variables with are either OS/2- and DOS- and Win*-specific, or are more important under OS/2 than under other OSes. =head2 C Specific for EMX port. Should have the form path1;path2 or path1 path2 If the beginning of some prebuilt path matches F, it is substituted with F. Should be used if the perl library is moved from the default location in preference to C, since this would not leave wrong entries in @INC. For example, if the compiled version of perl looks for @INC in F, and you want to install the library in F, do set PERLLIB_PREFIX=f:/perllib/lib;h:/opt/gnu This will cause Perl with the prebuilt @INC of f:/perllib/lib/5.00553/os2 f:/perllib/lib/5.00553 f:/perllib/lib/site_perl/5.00553/os2 f:/perllib/lib/site_perl/5.00553 . to use the following @INC: h:/opt/gnu/5.00553/os2 h:/opt/gnu/5.00553 h:/opt/gnu/site_perl/5.00553/os2 h:/opt/gnu/site_perl/5.00553 . =head2 C If 0, perl ignores setlocale() failing. May be useful with some strange Is. =head2 C If 0, perl would not warn of in case of unwarranted free(). With older perls this might be useful in conjunction with the module DB_File, which was buggy when dynamically linked and OMF-built. Should not be set with newer Perls, since this may hide some I problems. =head2 C Specific for EMX port. Gives the directory part of the location for F. =head2 C Specific for EMX port. Since L is present in EMX, but is not functional, it is emulated by perl. To disable the emulations, set environment variable C. =head2 C or C Specific for EMX port. Used as storage place for temporary files. =head1 Evolution Here we list major changes which could make you by surprise. =head2 Text-mode filehandles Starting from version 5.8, Perl uses a builtin translation layer for text-mode files. This replaces the efficient well-tested EMX layer by some code which should be best characterized as a "quick hack". In addition to possible bugs and an inability to follow changes to the translation policy with off/on switches of TERMIO translation, this introduces a serious incompatible change: before sysread() on text-mode filehandles would go through the translation layer, now it would not. =head2 Priorities C and C are not compatible with earlier ports by Andreas Kaiser. See C<"setpriority, getpriority">. =head2 DLL name mangling: pre 5.6.2 With the release 5.003_01 the dynamically loadable libraries should be rebuilt when a different version of Perl is compiled. In particular, DLLs (including F) are now created with the names which contain a checksum, thus allowing workaround for OS/2 scheme of caching DLLs. It may be possible to code a simple workaround which would =over =item * find the old DLLs looking through the old @INC; =item * mangle the names according to the scheme of new perl and copy the DLLs to these names; =item * edit the internal C tables of DLL to reflect the change of the name (probably not needed for Perl extension DLLs, since the internally coded names are not used for "specific" DLLs, they used only for "global" DLLs). =item * edit the internal C tables and change the name of the "old" F to the "new" F. =back =head2 DLL name mangling: 5.6.2 and beyond In fact mangling of I DLLs was done due to misunderstanding of the OS/2 dynaloading model. OS/2 (effectively) maintains two different tables of loaded DLL: =over =item Global DLLs those loaded by the base name from C; including those associated at link time; =item specific DLLs loaded by the full name. =back When resolving a request for a global DLL, the table of already-loaded specific DLLs is (effectively) ignored; moreover, specific DLLs are I loaded from the prescribed path. There is/was a minor twist which makes this scheme fragile: what to do with DLLs loaded from =over =item C and C (which depend on the process) =item F<.> from C which I depends on the process (although C is the same for all the processes). =back Unless C is set to C (and the kernel is after 2000/09/01), such DLLs are considered to be global. When loading a global DLL it is first looked in the table of already-loaded global DLLs. Because of this the fact that one executable loaded a DLL from C and C, or F<.> from C may affect I DLL is loaded when I executable requests a DLL with the same name. I is the reason for version-specific mangling of the DLL name for perl DLL. Since the Perl extension DLLs are always loaded with the full path, there is no need to mangle their names in a version-specific ways: their directory already reflects the corresponding version of perl, and @INC takes into account binary compatibility with older version. Starting from C<5.6.2> the name mangling scheme is fixed to be the same as for Perl 5.005_53 (same as in a popular binary release). Thus new Perls will be able to I of old extension DLLs if @INC allows finding their directories. However, this still does not guarantee that these DLL may be loaded. The reason is the mangling of the name of the I. And since the extension DLLs link with the Perl DLL, extension DLLs for older versions would load an older Perl DLL, and would most probably segfault (since the data in this DLL is not properly initialized). There is a partial workaround (which can be made complete with newer OS/2 kernels): create a forwarder DLL with the same name as the DLL of the older version of Perl, which forwards the entry points to the newer Perl's DLL. Make this DLL accessible on (say) the C of the new Perl executable. When the new executable accesses old Perl's extension DLLs, they would request the old Perl's DLL by name, get the forwarder instead, so effectively will link with the currently running (new) Perl DLL. This may break in two ways: =over =item * Old perl executable is started when a new executable is running has loaded an extension compiled for the old executable (ouph!). In this case the old executable will get a forwarder DLL instead of the old perl DLL, so would link with the new perl DLL. While not directly fatal, it will behave the same as new executable. This beats the whole purpose of explicitly starting an old executable. =item * A new executable loads an extension compiled for the old executable when an old perl executable is running. In this case the extension will not pick up the forwarder - with fatal results. =back With support for C this may be circumvented - unless one of DLLs is started from F<.> from C (I do not know whether C affects this case). B. Unless newer kernels allow F<.> in C (older do not), this mess cannot be completely cleaned. (It turns out that as of the beginning of 2002, F<.> is not allowed, but F<.\.> is - and it has the same effect.) B. C, C and C are not environment variables, although F emulates them on C lines. From Perl they may be accessed by L and L. =head2 DLL forwarder generation Assume that the old DLL is named F (as is one for 5.005_53), and the new version is 5.6.1. Create a file F with LIBRARY 'perlE0AC' INITINSTANCE TERMINSTANCE DESCRIPTION '@#perl5-porters@perl.org:5.006001#@ Perl module for 5.00553 -> Perl 5.6.1 forwarder' CODE LOADONCALL DATA LOADONCALL NONSHARED MULTIPLE EXPORTS modifying the versions/names as needed. Run perl -wnle "next if 0../EXPORTS/; print qq( \"$1\") if /\"(\w+)\"/" perl5.def >lst in the Perl build directory (to make the DLL smaller replace perl5.def with the definition file for the older version of Perl if present). cat perl5shim.def-leader lst >perl5shim.def gcc -Zomf -Zdll -o perlE0AC.dll perl5shim.def -s -llibperl (ignore multiple C). =head2 Threading As of release 5.003_01 perl is linked to multithreaded C RTL DLL. If perl itself is not compiled multithread-enabled, so will not be perl's malloc(). However, extensions may use multiple thread on their own risk. This was needed to compile C for XFree86-OS/2 out-of-the-box, and link with DLLs for other useful libraries, which typically are compiled with C<-Zmt -Zcrtdll>. =head2 Calls to external programs Due to a popular demand the perl external program calling has been changed wrt Andreas Kaiser's port. I perl needs to call an external program I, the F will be called, or whatever is the override, see L<"PERL_SH_DIR">. Thus means that you need to get some copy of a F as well (I use one from pdksh). The path F above is set up automatically during the build to a correct value on the builder machine, but is overridable at runtime, B a consensus on C was that perl should use one non-overridable shell per platform. The obvious choices for OS/2 are F and F. Having perl build itself would be impossible with F as a shell, thus I picked up C. This assures almost 100% compatibility with the scripts coming from *nix. As an added benefit this works as well under DOS if you use DOS-enabled port of pdksh (see L). B currently F of pdksh calls external programs via fork()/exec(), and there is I functioning exec() on OS/2. exec() is emulated by EMX by an asynchronous call while the caller waits for child completion (to pretend that the C did not change). This means that 1 I copy of F is made active via fork()/exec(), which may lead to some resources taken from the system (even if we do not count extra work needed for fork()ing). Note that this a lesser issue now when we do not spawn F unless needed (metachars found). One can always start F explicitly via system 'cmd', '/c', 'mycmd', 'arg1', 'arg2', ... If you need to use F, and do not want to hand-edit thousands of your scripts, the long-term solution proposed on p5-p is to have a directive use OS2::Cmd; which will override system(), exec(), C<``>, and C. With current perl you may override only system(), readpipe() - the explicit version of C<``>, and maybe exec(). The code will substitute the one-argument call to system() by C. If you have some working code for C, please send it to me, I will include it into distribution. I have no need for such a module, so cannot test it. For the details of the current situation with calling external programs, see L2 (and DOS) programs under Perl>. Set us mention a couple of features: =over 4 =item * External scripts may be called by their basename. Perl will try the same extensions as when processing B<-S> command-line switch. =item * External scripts starting with C<#!> or C will be executed directly, without calling the shell, by calling the program specified on the rest of the first line. =back =head2 Memory allocation Perl uses its own malloc() under OS/2 - interpreters are usually malloc-bound for speed, but perl is not, since its malloc is lightning-fast. Perl-memory-usage-tuned benchmarks show that Perl's malloc is 5 times quicker than EMX one. I do not have convincing data about memory footprint, but a (pretty random) benchmark showed that Perl's one is 5% better. Combination of perl's malloc() and rigid DLL name resolution creates a special problem with library functions which expect their return value to be free()d by system's free(). To facilitate extensions which need to call such functions, system memory-allocation functions are still available with the prefix C added. (Currently only DLL perl has this, it should propagate to F shortly.) =head2 Threads One can build perl with thread support enabled by providing C<-D usethreads> option to F. Currently OS/2 support of threads is very preliminary. Most notable problems: =over 4 =item C may have a race condition (but probably does not due to edge-triggered nature of OS/2 Event semaphores). (Needs a reimplementation (in terms of chaining waiting threads, with the linked list stored in per-thread structure?)?) =item F has a couple of static variables used in OS/2-specific functions. (Need to be moved to per-thread structure, or serialized?) =back Note that these problems should not discourage experimenting, since they have a low probability of affecting small programs. =head1 BUGS This description is not updated often (since 5.6.1?), see F<./os2/Changes> for more info. =cut OS/2 extensions ~~~~~~~~~~~~~~~ I include 3 extensions by Andreas Kaiser, OS2::REXX, OS2::UPM, and OS2::FTP, into my ftp directory, mirrored on CPAN. I made some minor changes needed to compile them by standard tools. I cannot test UPM and FTP, so I will appreciate your feedback. Other extensions there are OS2::ExtAttr, OS2::PrfDB for tied access to EAs and .INI files - and maybe some other extensions at the time you read it. Note that OS2 perl defines 2 pseudo-extension functions OS2::Copy::copy and DynaLoader::mod2fname (many more now, see L). The -R switch of older perl is deprecated. If you need to call a REXX code which needs access to variables, include the call into a REXX compartment created by REXX_call {...block...}; Two new functions are supported by REXX code, REXX_eval 'string'; REXX_eval_with 'string', REXX_function_name => \&perl_sub_reference; If you have some other extensions you want to share, send the code to me. At least two are available: tied access to EA's, and tied access to system databases. =head1 AUTHOR Ilya Zakharevich, cpan@ilyaz.org =head1 SEE ALSO perl(1). =cut perlmodinstall.pod000064400000030646147633747270010336 0ustar00=head1 NAME perlmodinstall - Installing CPAN Modules =head1 DESCRIPTION You can think of a module as the fundamental unit of reusable Perl code; see L for details. Whenever anyone creates a chunk of Perl code that they think will be useful to the world, they register as a Perl developer at http://www.cpan.org/modules/04pause.html so that they can then upload their code to the CPAN. The CPAN is the Comprehensive Perl Archive Network and can be accessed at http://www.cpan.org/ , and searched at http://search.cpan.org/ . This documentation is for people who want to download CPAN modules and install them on their own computer. =head2 PREAMBLE First, are you sure that the module isn't already on your system? Try C. (Replace "Foo" with the name of the module; for instance, C. If you don't see an error message, you have the module. (If you do see an error message, it's still possible you have the module, but that it's not in your path, which you can display with C.) For the remainder of this document, we'll assume that you really honestly truly lack an installed module, but have found it on the CPAN. So now you have a file ending in .tar.gz (or, less often, .zip). You know there's a tasty module inside. There are four steps you must now take: =over 5 =item B the file =item B the file into a directory =item B the module (sometimes unnecessary) =item B the module. =back Here's how to perform each step for each operating system. This is a substitute for reading the README and INSTALL files that might have come with your module! Also note that these instructions are tailored for installing the module into your system's repository of Perl modules, but you can install modules into any directory you wish. For instance, where I say C, you can substitute C to install the modules into C. Then you can use the modules from your Perl programs with C or sometimes just C. If you're on a system that requires superuser/root access to install modules into the directories you see when you type C, you'll want to install them into a local directory (such as your home directory) and use this approach. =over 4 =item * B You can use Andreas Koenig's CPAN module ( http://www.cpan.org/modules/by-module/CPAN ) to automate the following steps, from DECOMPRESS through INSTALL. A. DECOMPRESS Decompress the file with C You can get gzip from ftp://prep.ai.mit.edu/pub/gnu/ Or, you can combine this step with the next to save disk space: gzip -dc yourmodule.tar.gz | tar -xof - B. UNPACK Unpack the result with C C. BUILD Go into the newly-created directory and type: perl Makefile.PL make test or perl Makefile.PL PREFIX=/my/perl_directory to install it locally. (Remember that if you do this, you'll have to put C near the top of the program that is to use this module. D. INSTALL While still in that directory, type: make install Make sure you have the appropriate permissions to install the module in your Perl 5 library directory. Often, you'll need to be root. That's all you need to do on Unix systems with dynamic linking. Most Unix systems have dynamic linking. If yours doesn't, or if for another reason you have a statically-linked perl, B the module requires compilation, you'll need to build a new Perl binary that includes the module. Again, you'll probably need to be root. =item * B First, type C from a shell and see whether ActiveState's PPM repository has your module. If so, you can install it with C and you won't have to bother with any of the other steps here. You might be able to use the CPAN instructions from the "Unix or Linux" section above as well; give it a try. Otherwise, you'll have to follow the steps below. A. DECOMPRESS You can use the shareware Winzip ( http://www.winzip.com ) to decompress and unpack modules. B. UNPACK If you used WinZip, this was already done for you. C. BUILD You'll need the C utility, available at http://download.microsoft.com/download/vc15/Patch/1.52/W95/EN-US/nmake15.exe or dmake, available on CPAN. http://search.cpan.org/dist/dmake/ Does the module require compilation (i.e. does it have files that end in .xs, .c, .h, .y, .cc, .cxx, or .C)? If it does, life is now officially tough for you, because you have to compile the module yourself (no easy feat on Windows). You'll need a compiler such as Visual C++. Alternatively, you can download a pre-built PPM package from ActiveState. http://aspn.activestate.com/ASPN/Downloads/ActivePerl/PPM/ Go into the newly-created directory and type: perl Makefile.PL nmake test D. INSTALL While still in that directory, type: nmake install =item * B A. DECOMPRESS First, make sure you have the latest B distribution ( http://www.cpan.org/authors/id/CNANDOR/ ), which has utilities for doing all of the steps. Read the cpan-mac directions carefully and install it. If you choose not to use cpan-mac for some reason, there are alternatives listed here. After installing cpan-mac, drop the module archive on the B droplet, which will decompress and unpack for you. B, you can either use the shareware B program ( http://my.smithmicro.com/mac/stuffit/ ) or the freeware B program ( http://persephone.cps.unizar.es/general/gente/spd/gzip/gzip.html ). B. UNPACK If you're using untarzipme or StuffIt, the archive should be extracted now. B, you can use the freeware B or I ( http://hyperarchive.lcs.mit.edu/HyperArchive/Archive/cmp/ ). C. BUILD Check the contents of the distribution. Read the module's documentation, looking for reasons why you might have trouble using it with MacPerl. Look for F<.xs> and F<.c> files, which normally denote that the distribution must be compiled, and you cannot install it "out of the box." (See L<"PORTABILITY">.) D. INSTALL If you are using cpan-mac, just drop the folder on the B droplet, and use the module. B, if you aren't using cpan-mac, do some manual labor. Make sure the newlines for the modules are in Mac format, not Unix format. If they are not then you might have decompressed them incorrectly. Check your decompression and unpacking utilities settings to make sure they are translating text files properly. As a last resort, you can use the perl one-liner: perl -i.bak -pe 's/(?:\015)?\012/\015/g' on the source files. Then move the files (probably just the F<.pm> files, though there may be some additional ones, too; check the module documentation) to their final destination: This will most likely be in C<$ENV{MACPERL}site_lib:> (i.e., C). You can add new paths to the default C<@INC> in the Preferences menu item in the MacPerl application (C<$ENV{MACPERL}site_lib:> is added automagically). Create whatever directory structures are required (i.e., for C, create C<$ENV{MACPERL}site_lib:Some:> and put C in that directory). Then run the following script (or something like it): #!perl -w use AutoSplit; my $dir = "${MACPERL}site_perl"; autosplit("$dir:Some:Module.pm", "$dir:auto", 0, 1, 1); =item * B A. DECOMPRESS djtarx ( ftp://ftp.delorie.com/pub/djgpp/current/v2/ ) will both uncompress and unpack. B. UNPACK See above. C. BUILD Go into the newly-created directory and type: perl Makefile.PL make test You will need the packages mentioned in F in the Perl distribution. D. INSTALL While still in that directory, type: make install You will need the packages mentioned in F in the Perl distribution. =item * B Get the EMX development suite and gzip/tar, from either Hobbes ( http://hobbes.nmsu.edu ) or Leo ( http://www.leo.org ), and then follow the instructions for Unix. =item * B When downloading from CPAN, save your file with a C<.tgz> extension instead of C<.tar.gz>. All other periods in the filename should be replaced with underscores. For example, C should be downloaded as C. A. DECOMPRESS Type gzip -d Your-Module.tgz or, for zipped modules, type unzip Your-Module.zip Executables for gzip, zip, and VMStar: http://www.hp.com/go/openvms/freeware/ and their source code: http://www.fsf.org/order/ftp.html Note that GNU's gzip/gunzip is not the same as Info-ZIP's zip/unzip package. The former is a simple compression tool; the latter permits creation of multi-file archives. B. UNPACK If you're using VMStar: VMStar xf Your-Module.tar Or, if you're fond of VMS command syntax: tar/extract/verbose Your_Module.tar C. BUILD Make sure you have MMS (from Digital) or the freeware MMK ( available from MadGoat at http://www.madgoat.com ). Then type this to create the DESCRIP.MMS for the module: perl Makefile.PL Now you're ready to build: mms test Substitute C for C above if you're using MMK. D. INSTALL Type mms install Substitute C for C above if you're using MMK. =item * B, Introduce the F<.tar.gz> file into an HFS as binary; don't translate from ASCII to EBCDIC. A. DECOMPRESS Decompress the file with C You can get gzip from http://www.s390.ibm.com/products/oe/bpxqp1.html B. UNPACK Unpack the result with pax -o to=IBM-1047,from=ISO8859-1 -r < yourmodule.tar The BUILD and INSTALL steps are identical to those for Unix. Some modules generate Makefiles that work better with GNU make, which is available from http://www.mks.com/s390/gnu/ =back =head1 PORTABILITY Note that not all modules will work with on all platforms. See L for more information on portability issues. Read the documentation to see if the module will work on your system. There are basically three categories of modules that will not work "out of the box" with all platforms (with some possibility of overlap): =over 4 =item * B These need to be fixed; consider contacting the author and possibly writing a patch. =item * B (These modules contain F<.xs> or F<.c> files, usually.) You might be able to find existing binaries on the CPAN or elsewhere, or you might want to try getting compilers and building it yourself, and then release the binary for other poor souls to use. =item * B (Such as the Win32:: modules.) If the module is targeted specifically at a platform other than yours, you're out of luck, most likely. =back Check the CPAN Testers if a module should work with your platform but it doesn't behave as you'd expect, or you aren't sure whether or not a module will work under your platform. If the module you want isn't listed there, you can test it yourself and let CPAN Testers know, you can join CPAN Testers, or you can request it be tested. http://testers.cpan.org/ =head1 HEY If you have any suggested changes for this page, let me know. Please don't send me mail asking for help on how to install your modules. There are too many modules, and too few Orwants, for me to be able to answer or even acknowledge all your questions. Contact the module author instead, or post to comp.lang.perl.modules, or ask someone familiar with Perl on your operating system. =head1 AUTHOR Jon Orwant orwant@medita.mit.edu with invaluable help from Chris Nandor, and valuable help from Brandon Allbery, Charles Bailey, Graham Barr, Dominic Dunlop, Jarkko Hietaniemi, Ben Holzman, Tom Horsley, Nick Ing-Simmons, Tuomas J. Lukka, Laszlo Molnar, Alan Olsen, Peter Prymmer, Gurusamy Sarathy, Christoph Spalinger, Dan Sugalski, Larry Virden, and Ilya Zakharevich. First version July 22, 1998; last revised November 21, 2001. =head1 COPYRIGHT Copyright (C) 1998, 2002, 2003 Jon Orwant. All Rights Reserved. This document may be distributed under the same terms as Perl itself. perlstyle.pod000064400000020652147633747270007324 0ustar00=head1 NAME perlstyle - Perl style guide =head1 DESCRIPTION Each programmer will, of course, have his or her own preferences in regards to formatting, but there are some general guidelines that will make your programs easier to read, understand, and maintain. The most important thing is to run your programs under the B<-w> flag at all times. You may turn it off explicitly for particular portions of code via the C pragma or the C<$^W> variable if you must. You should also always run under C or know the reason why not. The C and even C pragmas may also prove useful. Regarding aesthetics of code lay out, about the only thing Larry cares strongly about is that the closing curly bracket of a multi-line BLOCK should line up with the keyword that started the construct. Beyond that, he has other preferences that aren't so strong: =over 4 =item * 4-column indent. =item * Opening curly on same line as keyword, if possible, otherwise line up. =item * Space before the opening curly of a multi-line BLOCK. =item * One-line BLOCK may be put on one line, including curlies. =item * No space before the semicolon. =item * Semicolon omitted in "short" one-line BLOCK. =item * Space around most operators. =item * Space around a "complex" subscript (inside brackets). =item * Blank lines between chunks that do different things. =item * Uncuddled elses. =item * No space between function name and its opening parenthesis. =item * Space after each comma. =item * Long lines broken after an operator (except C and C). =item * Space after last parenthesis matching on current line. =item * Line up corresponding items vertically. =item * Omit redundant punctuation as long as clarity doesn't suffer. =back Larry has his reasons for each of these things, but he doesn't claim that everyone else's mind works the same as his does. Here are some other more substantive style issues to think about: =over 4 =item * Just because you I do something a particular way doesn't mean that you I do it that way. Perl is designed to give you several ways to do anything, so consider picking the most readable one. For instance open(FOO,$foo) || die "Can't open $foo: $!"; is better than die "Can't open $foo: $!" unless open(FOO,$foo); because the second way hides the main point of the statement in a modifier. On the other hand print "Starting analysis\n" if $verbose; is better than $verbose && print "Starting analysis\n"; because the main point isn't whether the user typed B<-v> or not. Similarly, just because an operator lets you assume default arguments doesn't mean that you have to make use of the defaults. The defaults are there for lazy systems programmers writing one-shot programs. If you want your program to be readable, consider supplying the argument. Along the same lines, just because you I omit parentheses in many places doesn't mean that you ought to: return print reverse sort num values %array; return print(reverse(sort num (values(%array)))); When in doubt, parenthesize. At the very least it will let some poor schmuck bounce on the % key in B. Even if you aren't in doubt, consider the mental welfare of the person who has to maintain the code after you, and who will probably put parentheses in the wrong place. =item * Don't go through silly contortions to exit a loop at the top or the bottom, when Perl provides the C operator so you can exit in the middle. Just "outdent" it a little to make it more visible: LINE: for (;;) { statements; last LINE if $foo; next LINE if /^#/; statements; } =item * Don't be afraid to use loop labels--they're there to enhance readability as well as to allow multilevel loop breaks. See the previous example. =item * Avoid using C (or C) or `backticks` in a void context, that is, when you just throw away their return values. Those functions all have return values, so use them. Otherwise use a C loop or the C function instead. =item * For portability, when using features that may not be implemented on every machine, test the construct in an eval to see if it fails. If you know what version or patchlevel a particular feature was implemented, you can test C<$]> (C<$PERL_VERSION> in C) to see if it will be there. The C module will also let you interrogate values determined by the B program when Perl was installed. =item * Choose mnemonic identifiers. If you can't remember what mnemonic means, you've got a problem. =item * While short identifiers like C<$gotit> are probably ok, use underscores to separate words in longer identifiers. It is generally easier to read C<$var_names_like_this> than C<$VarNamesLikeThis>, especially for non-native speakers of English. It's also a simple rule that works consistently with C. Package names are sometimes an exception to this rule. Perl informally reserves lowercase module names for "pragma" modules like C and C. Other modules should begin with a capital letter and use mixed case, but probably without underscores due to limitations in primitive file systems' representations of module names as files that must fit into a few sparse bytes. =item * You may find it helpful to use letter case to indicate the scope or nature of a variable. For example: $ALL_CAPS_HERE constants only (beware clashes with perl vars!) $Some_Caps_Here package-wide global/static $no_caps_here function scope my() or local() variables Function and method names seem to work best as all lowercase. E.g., C<$obj-Eas_string()>. You can use a leading underscore to indicate that a variable or function should not be used outside the package that defined it. =item * If you have a really hairy regular expression, use the C modifier and put in some whitespace to make it look a little less like line noise. Don't use slash as a delimiter when your regexp has slashes or backslashes. =item * Use the new C and C operators to avoid having to parenthesize list operators so much, and to reduce the incidence of punctuation operators like C<&&> and C<||>. Call your subroutines as if they were functions or list operators to avoid excessive ampersands and parentheses. =item * Use here documents instead of repeated C statements. =item * Line up corresponding things vertically, especially if it'd be too long to fit on one line anyway. $IDX = $ST_MTIME; $IDX = $ST_ATIME if $opt_u; $IDX = $ST_CTIME if $opt_c; $IDX = $ST_SIZE if $opt_s; mkdir $tmpdir, 0700 or die "can't mkdir $tmpdir: $!"; chdir($tmpdir) or die "can't chdir $tmpdir: $!"; mkdir 'tmp', 0777 or die "can't mkdir $tmpdir/tmp: $!"; =item * Always check the return codes of system calls. Good error messages should go to C, include which program caused the problem, what the failed system call and arguments were, and (VERY IMPORTANT) should contain the standard system error message for what went wrong. Here's a simple but sufficient example: opendir(D, $dir) or die "can't opendir $dir: $!"; =item * Line up your transliterations when it makes sense: tr [abc] [xyz]; =item * Think about reusability. Why waste brainpower on a one-shot when you might want to do something like it again? Consider generalizing your code. Consider writing a module or object class. Consider making your code run cleanly with C and C (or B<-w>) in effect. Consider giving away your code. Consider changing your whole world view. Consider... oh, never mind. =item * Try to document your code and use Pod formatting in a consistent way. Here are commonly expected conventions: =over 4 =item * use CE> for function, variable and module names (and more generally anything that can be considered part of code, like filehandles or specific values). Note that function names are considered more readable with parentheses after their name, that is C. =item * use CE> for commands names like B or B. =item * use CE> or CE> for file names. CE> should be the only Pod code for file names, but as most Pod formatters render it as italic, Unix and Windows paths with their slashes and backslashes may be less readable, and better rendered with CE>. =back =item * Be consistent. =item * Be nice. =back perlamiga.pod000064400000015572147633747270007247 0ustar00If you read this file _as_is_, just ignore the funny characters you see. It is written in the POD format (see perlpod manpage) which is specially designed to be readable as is. =head1 NAME perlamiga - Perl under Amiga OS =head1 NOTE B if you want to help fixing this problem.> =head1 SYNOPSIS One can read this document in the following formats: man perlamiga multiview perlamiga.guide to list some (not all may be available simultaneously), or it may be read I: either as F, or F. A recent version of perl for the Amiga can be found at the Geek Gadgets section of the Aminet: http://www.aminet.net/~aminet/dev/gg =cut Contents perlamiga - Perl under Amiga OS NAME SYNOPSIS DESCRIPTION - Prerequisites - Starting Perl programs under AmigaOS - Shortcomings of Perl under AmigaOS INSTALLATION Accessing documentation - Manpages - HTML - GNU info files - LaTeX docs BUILD - Build Prerequisites - Getting the perl source - Application of the patches - Making - Testing - Installing the built perl AUTHOR SEE ALSO =head1 DESCRIPTION =head2 Prerequisites for Compiling Perl on AmigaOS =over 6 =item B You need the Unix emulation for AmigaOS, whose most important part is B. For a minimum setup, get the latest versions of the following packages from the Aminet archives ( L ): ixemul-bin ixemul-env-bin pdksh-bin Note also that this is a minimum setup; you might want to add other packages of B (the I). =item B You need at the very least AmigaOS version 2.0. Recommended is version 3.1. =back =head2 Starting Perl programs under AmigaOS Start your Perl program F with arguments C the same way as on any other platform, by perl foo arg1 arg2 arg3 If you want to specify perl options C<-my_opts> to the perl itself (as opposed to your program), use perl -my_opts foo arg1 arg2 arg3 Alternately, you can try to get a replacement for the system's B command that honors the #!/usr/bin/perl syntax in scripts and set the s-Bit of your scripts. Then you can invoke your scripts like under UNIX with foo arg1 arg2 arg3 (Note that having *nixish full path to perl F is not necessary, F would be enough, but having full path would make it easier to use your script under *nix.) =head2 Shortcomings of Perl under AmigaOS Perl under AmigaOS lacks some features of perl under UNIX because of deficiencies in the UNIX-emulation, most notably: =over 6 =item * fork() =item * some features of the UNIX filesystem regarding link count and file dates =item * inplace operation (the -i switch) without backup file =item * umask() works, but the correct permissions are only set when the file is finally close()d =back =head1 INSTALLATION Change to the installation directory (most probably ADE:), and extract the binary distribution: lha -mraxe x perl-$VERSION-bin.lha or tar xvzpf perl-$VERSION-bin.tgz (Of course you need lha or tar and gunzip for this.) For installation of the Unix emulation, read the appropriate docs. =head1 Accessing documentation =head2 Manpages for Perl on AmigaOS If you have C installed on your system, and you installed perl manpages, use something like this: man perlfunc man less man ExtUtils.MakeMaker to access documentation for different components of Perl. Start with man perl Note: You have to modify your man.conf file to search for manpages in the /ade/lib/perl5/man/man3 directory, or the man pages for the perl library will not be found. Note that dot (F<.>) is used as a package separator for documentation for packages, and as usual, sometimes you need to give the section - C<3> above - to avoid shadowing by the I. =head2 Perl HTML Documentation on AmigaOS If you have some WWW browser available, you can build B docs. Cd to directory with F<.pod> files, and do like this cd /ade/lib/perl5/pod pod2html After this you can direct your browser the file F in this directory, and go ahead with reading docs. Alternatively you may be able to get these docs prebuilt from C. =head2 Perl GNU Info Files on AmigaOS Users of C would appreciate it very much, especially with C mode loaded. You need to get latest C from C, or, alternately, prebuilt info pages. =head2 Perl LaTeX Documentation on AmigaOS Can be constructed using C. =head1 BUILDING PERL ON AMIGAOS Here we discuss how to build Perl under AmigaOS. =head2 Build Prerequisites for Perl on AmigaOS You need to have the latest B (Unix emulation for Amiga) from Aminet. =head2 Getting the Perl Source for AmigaOS You can either get the latest perl-for-amiga source from Ninemoons and extract it with: tar xvzpf perl-$VERSION-src.tgz or get the official source from CPAN: http://www.cpan.org/src/5.0 Extract it like this tar xvzpf perl-$VERSION.tar.gz You will see a message about errors while extracting F. This is normal and expected. (There is a conflict with a similarly-named file F, but it causes no harm.) =head2 Making Perl on AmigaOS Remember to use a hefty wad of stack (I use 2000000) sh configure.gnu --prefix=/gg Now type make depend Now! make =head2 Testing Perl on AmigaOS Now run make test Some tests will be skipped because they need the fork() function: F, F, F, F, F, F, F =head2 Installing the built Perl on AmigaOS Run make install =head1 PERL 5.8.0 BROKEN IN AMIGAOS As told above, Perl 5.6.1 was still good in AmigaOS, as was 5.7.2. After Perl 5.7.2 (change #11423, see the Changes file, and the file pod/perlhack.pod for how to get the individual changes) Perl dropped its internal support for vfork(), and that was very probably the step that broke AmigaOS (since the ixemul library has only vfork). The build finally fails when the ext/DynaLoader is being built, and PERL ends up as "0" in the produced Makefile, trying to run "0" does not quite work. Also, executing miniperl in backticks seems to generate nothing: very probably related to the (v)fork problems. B =head1 AUTHORS Norbert Pueschel, pueschel@imsdd.meb.uni-bonn.de Jan-Erik Karlsson, trg@privat.utfors.se =head1 SEE ALSO perl(1). =cut perl581delta.pod000064400000112254147633747270007513 0ustar00=head1 NAME perl581delta - what is new for perl v5.8.1 =head1 DESCRIPTION This document describes differences between the 5.8.0 release and the 5.8.1 release. If you are upgrading from an earlier release such as 5.6.1, first read the L, which describes differences between 5.6.0 and 5.8.0. In case you are wondering about 5.6.1, it was bug-fix-wise rather identical to the development release 5.7.1. Confused? This timeline hopefully helps a bit: it lists the new major releases, their maintenance releases, and the development releases. New Maintenance Development 5.6.0 2000-Mar-22 5.7.0 2000-Sep-02 5.6.1 2001-Apr-08 5.7.1 2001-Apr-09 5.7.2 2001-Jul-13 5.7.3 2002-Mar-05 5.8.0 2002-Jul-18 5.8.1 2003-Sep-25 =head1 Incompatible Changes =head2 Hash Randomisation Mainly due to security reasons, the "random ordering" of hashes has been made even more random. Previously while the order of hash elements from keys(), values(), and each() was essentially random, it was still repeatable. Now, however, the order varies between different runs of Perl. B, and the ordering has already changed several times during the lifetime of Perl 5. Also, the ordering of hash keys has always been, and continues to be, affected by the insertion order. The added randomness may affect applications. One possible scenario is when output of an application has included hash data. For example, if you have used the Data::Dumper module to dump data into different files, and then compared the files to see whether the data has changed, now you will have false positives since the order in which hashes are dumped will vary. In general the cure is to sort the keys (or the values); in particular for Data::Dumper to use the C option. If some particular order is really important, use tied hashes: for example the Tie::IxHash module which by default preserves the order in which the hash elements were added. More subtle problem is reliance on the order of "global destruction". That is what happens at the end of execution: Perl destroys all data structures, including user data. If your destructors (the DESTROY subroutines) have assumed any particular ordering to the global destruction, there might be problems ahead. For example, in a destructor of one object you cannot assume that objects of any other class are still available, unless you hold a reference to them. If the environment variable PERL_DESTRUCT_LEVEL is set to a non-zero value, or if Perl is exiting a spawned thread, it will also destruct the ordinary references and the symbol tables that are no longer in use. You can't call a class method or an ordinary function on a class that has been collected that way. The hash randomisation is certain to reveal hidden assumptions about some particular ordering of hash elements, and outright bugs: it revealed a few bugs in the Perl core and core modules. To disable the hash randomisation in runtime, set the environment variable PERL_HASH_SEED to 0 (zero) before running Perl (for more information see L), or to disable the feature completely in compile time, compile with C<-DNO_HASH_SEED> (see F). See L for the original rationale behind this change. =head2 UTF-8 On Filehandles No Longer Activated By Locale In Perl 5.8.0 all filehandles, including the standard filehandles, were implicitly set to be in Unicode UTF-8 if the locale settings indicated the use of UTF-8. This feature caused too many problems, so the feature was turned off and redesigned: see L. =head2 Single-number v-strings are no longer v-strings before "=>" The version strings or v-strings (see L) feature introduced in Perl 5.6.0 has been a source of some confusion-- especially when the user did not want to use it, but Perl thought it knew better. Especially troublesome has been the feature that before a "=>" a version string (a "v" followed by digits) has been interpreted as a v-string instead of a string literal. In other words: %h = ( v65 => 42 ); has meant since Perl 5.6.0 %h = ( 'A' => 42 ); (at least in platforms of ASCII progeny) Perl 5.8.1 restores the more natural interpretation %h = ( 'v65' => 42 ); The multi-number v-strings like v65.66 and 65.66.67 still continue to be v-strings in Perl 5.8. =head2 (Win32) The -C Switch Has Been Repurposed The -C switch has changed in an incompatible way. The old semantics of this switch only made sense in Win32 and only in the "use utf8" universe in 5.6.x releases, and do not make sense for the Unicode implementation in 5.8.0. Since this switch could not have been used by anyone, it has been repurposed. The behavior that this switch enabled in 5.6.x releases may be supported in a transparent, data-dependent fashion in a future release. For the new life of this switch, see L<"UTF-8 no longer default under UTF-8 locales">, and L. =head2 (Win32) The /d Switch Of cmd.exe Perl 5.8.1 uses the /d switch when running the cmd.exe shell internally for system(), backticks, and when opening pipes to external programs. The extra switch disables the execution of AutoRun commands from the registry, which is generally considered undesirable when running external programs. If you wish to retain compatibility with the older behavior, set PERL5SHELL in your environment to C. =head1 Core Enhancements =head2 UTF-8 no longer default under UTF-8 locales In Perl 5.8.0 many Unicode features were introduced. One of them was found to be of more nuisance than benefit: the automagic (and silent) "UTF-8-ification" of filehandles, including the standard filehandles, if the user's locale settings indicated use of UTF-8. For example, if you had C as your locale, your STDIN and STDOUT were automatically "UTF-8", in other words an implicit binmode(..., ":utf8") was made. This meant that trying to print, say, chr(0xff), ended up printing the bytes 0xc3 0xbf. Hardly what you had in mind unless you were aware of this feature of Perl 5.8.0. The problem is that the vast majority of people weren't: for example in RedHat releases 8 and 9 the B locale setting is UTF-8, so all RedHat users got UTF-8 filehandles, whether they wanted it or not. The pain was intensified by the Unicode implementation of Perl 5.8.0 (still) having nasty bugs, especially related to the use of s/// and tr///. (Bugs that have been fixed in 5.8.1) Therefore a decision was made to backtrack the feature and change it from implicit silent default to explicit conscious option. The new Perl command line option C<-C> and its counterpart environment variable PERL_UNICODE can now be used to control how Perl and Unicode interact at interfaces like I/O and for example the command line arguments. See L and L for more information. =head2 Unsafe signals again available In Perl 5.8.0 the so-called "safe signals" were introduced. This means that Perl no longer handles signals immediately but instead "between opcodes", when it is safe to do so. The earlier immediate handling easily could corrupt the internal state of Perl, resulting in mysterious crashes. However, the new safer model has its problems too. Because now an opcode, a basic unit of Perl execution, is never interrupted but instead let to run to completion, certain operations that can take a long time now really do take a long time. For example, certain network operations have their own blocking and timeout mechanisms, and being able to interrupt them immediately would be nice. Therefore perl 5.8.1 introduces a "backdoor" to restore the pre-5.8.0 (pre-5.7.3, really) signal behaviour. Just set the environment variable PERL_SIGNALS to C, and the old immediate (and unsafe) signal handling behaviour returns. See L and L. In completely unrelated news, you can now use safe signals with POSIX::SigAction. See L. =head2 Tied Arrays with Negative Array Indices Formerly, the indices passed to C, C, C, and C methods in tied array class were always non-negative. If the actual argument was negative, Perl would call FETCHSIZE implicitly and add the result to the index before passing the result to the tied array method. This behaviour is now optional. If the tied array class contains a package variable named C<$NEGATIVE_INDICES> which is set to a true value, negative values will be passed to C, C, C, and C unchanged. =head2 local ${$x} The syntaxes local ${$x} local @{$x} local %{$x} now do localise variables, given that the $x is a valid variable name. =head2 Unicode Character Database 4.0.0 The copy of the Unicode Character Database included in Perl 5.8 has been updated to 4.0.0 from 3.2.0. This means for example that the Unicode character properties are as in Unicode 4.0.0. =head2 Deprecation Warnings There is one new feature deprecation. Perl 5.8.0 forgot to add some deprecation warnings, these warnings have now been added. Finally, a reminder of an impending feature removal. =head3 (Reminder) Pseudo-hashes are deprecated (really) Pseudo-hashes were deprecated in Perl 5.8.0 and will be removed in Perl 5.10.0, see L for details. Each attempt to access pseudo-hashes will trigger the warning C. If you really want to continue using pseudo-hashes but not to see the deprecation warnings, use: no warnings 'deprecated'; Or you can continue to use the L pragma, but please don't expect the data structures to be pseudohashes any more. =head3 (Reminder) 5.005-style threads are deprecated (really) 5.005-style threads (activated by C) were deprecated in Perl 5.8.0 and will be removed after Perl 5.8, see L for details. Each 5.005-style thread creation will trigger the warning C<5.005 threads are deprecated>. If you really want to continue using the 5.005 threads but not to see the deprecation warnings, use: no warnings 'deprecated'; =head3 (Reminder) The $* variable is deprecated (really) The C<$*> variable controlling multi-line matching has been deprecated and will be removed after 5.8. The variable has been deprecated for a long time, and a deprecation warning C is given, now the variable will just finally be removed. The functionality has been supplanted by the C and C modifiers on pattern matching. If you really want to continue using the C<$*>-variable but not to see the deprecation warnings, use: no warnings 'deprecated'; =head2 Miscellaneous Enhancements C in void context is no longer expensive. C is now context aware, and will not construct a list if called in void context. If a socket gets closed by the server while printing to it, the client now gets a SIGPIPE. While this new feature was not planned, it fell naturally out of PerlIO changes, and is to be considered an accidental feature. PerlIO::get_layers(FH) returns the names of the PerlIO layers active on a filehandle. PerlIO::via layers can now have an optional UTF8 method to indicate whether the layer wants to "auto-:utf8" the stream. utf8::is_utf8() has been added as a quick way to test whether a scalar is encoded internally in UTF-8 (Unicode). =head1 Modules and Pragmata =head2 Updated Modules And Pragmata The following modules and pragmata have been updated since Perl 5.8.0: =over 4 =item base =item B::Bytecode In much better shape than it used to be. Still far from perfect, but maybe worth a try. =item B::Concise =item B::Deparse =item Benchmark An optional feature, C<:hireswallclock>, now allows for high resolution wall clock times (uses Time::HiRes). =item ByteLoader See B::Bytecode. =item bytes Now has bytes::substr. =item CGI =item charnames One can now have custom character name aliases. =item CPAN There is now a simple command line frontend to the CPAN.pm module called F. =item Data::Dumper A new option, Pair, allows choosing the separator between hash keys and values. =item DB_File =item Devel::PPPort =item Digest::MD5 =item Encode Significant updates on the encoding pragma functionality (tr/// and the DATA filehandle, formats). If a filehandle has been marked as to have an encoding, unmappable characters are detected already during input, not later (when the corrupted data is being used). The ISO 8859-6 conversion table has been corrected (the 0x30..0x39 erroneously mapped to U+0660..U+0669, instead of U+0030..U+0039). The GSM 03.38 conversion did not handle escape sequences correctly. The UTF-7 encoding has been added (making Encode feature-complete with Unicode::String). =item fields =item libnet =item Math::BigInt A lot of bugs have been fixed since v1.60, the version included in Perl v5.8.0. Especially noteworthy are the bug in Calc that caused div and mod to fail for some large values, and the fixes to the handling of bad inputs. Some new features were added, e.g. the broot() method, you can now pass parameters to config() to change some settings at runtime, and it is now possible to trap the creation of NaN and infinity. As usual, some optimizations took place and made the math overall a tad faster. In some cases, quite a lot faster, actually. Especially alternative libraries like Math::BigInt::GMP benefit from this. In addition, a lot of the quite clunky routines like fsqrt() and flog() are now much much faster. =item MIME::Base64 =item NEXT Diamond inheritance now works. =item Net::Ping =item PerlIO::scalar Reading from non-string scalars (like the special variables, see L) now works. =item podlators =item Pod::LaTeX =item PodParsers =item Pod::Perldoc Complete rewrite. As a side-effect, no longer refuses to startup when run by root. =item Scalar::Util New utilities: refaddr, isvstring, looks_like_number, set_prototype. =item Storable Can now store code references (via B::Deparse, so not foolproof). =item strict Earlier versions of the strict pragma did not check the parameters implicitly passed to its "import" (use) and "unimport" (no) routine. This caused the false idiom such as: use strict qw(@ISA); @ISA = qw(Foo); This however (probably) raised the false expectation that the strict refs, vars and subs were being enforced (and that @ISA was somehow "declared"). But the strict refs, vars, and subs are B enforced when using this false idiom. Starting from Perl 5.8.1, the above B cause an error to be raised. This may cause programs which used to execute seemingly correctly without warnings and errors to fail when run under 5.8.1. This happens because use strict qw(@ISA); will now fail with the error: Unknown 'strict' tag(s) '@ISA' The remedy to this problem is to replace this code with the correct idiom: use strict; use vars qw(@ISA); @ISA = qw(Foo); =item Term::ANSIcolor =item Test::Harness Now much more picky about extra or missing output from test scripts. =item Test::More =item Test::Simple =item Text::Balanced =item Time::HiRes Use of nanosleep(), if available, allows mixing subsecond sleeps with alarms. =item threads Several fixes, for example for join() problems and memory leaks. In some platforms (like Linux) that use glibc the minimum memory footprint of one ithread has been reduced by several hundred kilobytes. =item threads::shared Many memory leaks have been fixed. =item Unicode::Collate =item Unicode::Normalize =item Win32::GetFolderPath =item Win32::GetOSVersion Now returns extra information. =back =head1 Utility Changes The C utility now produces a more modern layout: F instead of F. Also, the boilerplate test is now called F instead of F. The Perl debugger (F) has now been extensively documented and bugs found while documenting have been fixed. C has been rewritten from scratch to be more robust and feature rich. C works now at least somewhat better, while C is rather more broken. (The Perl compiler suite as a whole continues to be experimental.) =head1 New Documentation perl573delta has been added to list the differences between the (now quite obsolete) development releases 5.7.2 and 5.7.3. perl58delta has been added: it is the perldelta of 5.8.0, detailing the differences between 5.6.0 and 5.8.0. perlartistic has been added: it is the Artistic License in pod format, making it easier for modules to refer to it. perlcheat has been added: it is a Perl cheat sheet. perlgpl has been added: it is the GNU General Public License in pod format, making it easier for modules to refer to it. perlmacosx has been added to tell about the installation and use of Perl in Mac OS X. perlos400 has been added to tell about the installation and use of Perl in OS/400 PASE. perlreref has been added: it is a regular expressions quick reference. =head1 Installation and Configuration Improvements The Unix standard Perl location, F, is no longer overwritten by default if it exists. This change was very prudent because so many Unix vendors already provide a F, but simultaneously many system utilities may depend on that exact version of Perl, so better not to overwrite it. One can now specify installation directories for site and vendor man and HTML pages, and site and vendor scripts. See F. One can now specify a destination directory for Perl installation by specifying the DESTDIR variable for C. (This feature is slightly different from the previous C.) See F. gcc versions 3.x introduced a new warning that caused a lot of noise during Perl compilation: C. This warning has now been avoided by Configure weeding out such directories before the compilation. One can now build subsets of Perl core modules by using the Configure flags C<-Dnoextensions=...> and C<-Donlyextensions=...>, see F. =head2 Platform-specific enhancements In Cygwin Perl can now be built with threads (C). This works with both Cygwin 1.3.22 and Cygwin 1.5.3. In newer FreeBSD releases Perl 5.8.0 compilation failed because of trying to use F, which in FreeBSD is just a dummy file, and a fatal error to even try to use. Now F is not used. Perl is now known to build also in Hitachi HI-UXMPP. Perl is now known to build again in LynxOS. Mac OS X now installs with Perl version number embedded in installation directory names for easier upgrading of user-compiled Perl, and the installation directories in general are more standard. In other words, the default installation no longer breaks the Apple-provided Perl. On the other hand, with C you can now really replace the Apple-supplied Perl (B). Mac OS X now builds Perl statically by default. This change was done mainly for faster startup times. The Apple-provided Perl is still dynamically linked and shared, and you can enable the sharedness for your own Perl builds by C. Perl has been ported to IBM's OS/400 PASE environment. The best way to build a Perl for PASE is to use an AIX host as a cross-compilation environment. See README.os400. Yet another cross-compilation option has been added: now Perl builds on OpenZaurus, an Linux distribution based on Mandrake + Embedix for the Sharp Zaurus PDA. See the Cross/README file. Tru64 when using gcc 3 drops the optimisation for F to C<-O2> because of gigantic memory use with the default C<-O3>. Tru64 can now build Perl with the newer Berkeley DBs. Building Perl on WinCE has been much enhanced, see F and F. =head1 Selected Bug Fixes =head2 Closures, eval and lexicals There have been many fixes in the area of anonymous subs, lexicals and closures. Although this means that Perl is now more "correct", it is possible that some existing code will break that happens to rely on the faulty behaviour. In practice this is unlikely unless your code contains a very complex nesting of anonymous subs, evals and lexicals. =head2 Generic fixes If an input filehandle is marked C<:utf8> and Perl sees illegal UTF-8 coming in when doing C<< >>, if warnings are enabled a warning is immediately given - instead of being silent about it and Perl being unhappy about the broken data later. (The C<:encoding(utf8)> layer also works the same way.) binmode(SOCKET, ":utf8") only worked on the input side, not on the output side of the socket. Now it works both ways. For threaded Perls certain system database functions like getpwent() and getgrent() now grow their result buffer dynamically, instead of failing. This means that at sites with lots of users and groups the functions no longer fail by returning only partial results. Perl 5.8.0 had accidentally broken the capability for users to define their own uppercase<->lowercase Unicode mappings (as advertised by the Camel). This feature has been fixed and is also documented better. In 5.8.0 this $some_unicode .= ; didn't work correctly but instead corrupted the data. This has now been fixed. Tied methods like FETCH etc. may now safely access tied values, i.e. resulting in a recursive call to FETCH etc. Remember to break the recursion, though. At startup Perl blocks the SIGFPE signal away since there isn't much Perl can do about it. Previously this blocking was in effect also for programs executed from within Perl. Now Perl restores the original SIGFPE handling routine, whatever it was, before running external programs. Linenumbers in Perl scripts may now be greater than 65536, or 2**16. (Perl scripts have always been able to be larger than that, it's just that the linenumber for reported errors and warnings have "wrapped around".) While scripts that large usually indicate a need to rethink your code a bit, such Perl scripts do exist, for example as results from generated code. Now linenumbers can go all the way to 4294967296, or 2**32. =head2 Platform-specific fixes Linux =over 4 =item * Setting $0 works again (with certain limitations that Perl cannot do much about: see L) =back HP-UX =over 4 =item * Setting $0 now works. =back VMS =over 4 =item * Configuration now tests for the presence of C, and IO::Poll now uses the vendor-supplied function if detected. =item * A rare access violation at Perl start-up could occur if the Perl image was installed with privileges or if there was an identifier with the subsystem attribute set in the process's rightslist. Either of these circumstances triggered tainting code that contained a pointer bug. The faulty pointer arithmetic has been fixed. =item * The length limit on values (not keys) in the %ENV hash has been raised from 255 bytes to 32640 bytes (except when the PERL_ENV_TABLES setting overrides the default use of logical names for %ENV). If it is necessary to access these long values from outside Perl, be aware that they are implemented using search list logical names that store the value in pieces, each 255-byte piece (up to 128 of them) being an element in the search list. When doing a lookup in %ENV from within Perl, the elements are combined into a single value. The existing VMS-specific ability to access individual elements of a search list logical name via the $ENV{'foo;N'} syntax (where N is the search list index) is unimpaired. =item * The piping implementation now uses local rather than global DCL symbols for inter-process communication. =item * File::Find could become confused when navigating to a relative directory whose name collided with a logical name. This problem has been corrected by adding directory syntax to relative path names, thus preventing logical name translation. =back Win32 =over 4 =item * A memory leak in the fork() emulation has been fixed. =item * The return value of the ioctl() built-in function was accidentally broken in 5.8.0. This has been corrected. =item * The internal message loop executed by perl during blocking operations sometimes interfered with messages that were external to Perl. This often resulted in blocking operations terminating prematurely or returning incorrect results, when Perl was executing under environments that could generate Windows messages. This has been corrected. =item * Pipes and sockets are now automatically in binary mode. =item * The four-argument form of select() did not preserve $! (errno) properly when there were errors in the underlying call. This is now fixed. =item * The "CR CR LF" problem of has been fixed, binmode(FH, ":crlf") is now effectively a no-op. =back =head1 New or Changed Diagnostics All the warnings related to pack() and unpack() were made more informative and consistent. =head2 Changed "A thread exited while %d threads were running" The old version A thread exited while %d other threads were still running was misleading because the "other" included also the thread giving the warning. =head2 Removed "Attempt to clear a restricted hash" It is not illegal to clear a restricted hash, so the warning was removed. =head2 New "Illegal declaration of anonymous subroutine" You must specify the block of code for C. =head2 Changed "Invalid range "%s" in transliteration operator" The old version Invalid [] range "%s" in transliteration operator was simply wrong because there are no "[] ranges" in tr///. =head2 New "Missing control char name in \c" Self-explanatory. =head2 New "Newline in left-justified string for %s" The padding spaces would appear after the newline, which is probably not what you had in mind. =head2 New "Possible precedence problem on bitwise %c operator" If you think this $x & $y == 0 tests whether the bitwise AND of $x and $y is zero, you will like this warning. =head2 New "Pseudo-hashes are deprecated" This warning should have been already in 5.8.0, since they are. =head2 New "read() on %s filehandle %s" You cannot read() (or sysread()) from a closed or unopened filehandle. =head2 New "5.005 threads are deprecated" This warning should have been already in 5.8.0, since they are. =head2 New "Tied variable freed while still in use" Something pulled the plug on a live tied variable, Perl plays safe by bailing out. =head2 New "To%s: illegal mapping '%s'" An illegal user-defined Unicode casemapping was specified. =head2 New "Use of freed value in iteration" Something modified the values being iterated over. This is not good. =head1 Changed Internals These news matter to you only if you either write XS code or like to know about or hack Perl internals (using Devel::Peek or any of the C modules counts), or like to run Perl with the C<-D> option. The embedding examples of L have been reviewed to be up to date and consistent: for example, the correct use of PERL_SYS_INIT3() and PERL_SYS_TERM(). Extensive reworking of the pad code (the code responsible for lexical variables) has been conducted by Dave Mitchell. Extensive work on the v-strings by John Peacock. UTF-8 length and position cache: to speed up the handling of Unicode (UTF-8) scalars, a cache was introduced. Potential problems exist if an extension bypasses the official APIs and directly modifies the PV of an SV: the UTF-8 cache does not get cleared as it should. APIs obsoleted in Perl 5.8.0, like sv_2pv, sv_catpvn, sv_catsv, sv_setsv, are again available. Certain Perl core C APIs like cxinc and regatom are no longer available at all to code outside the Perl core of the Perl core extensions. This is intentional. They never should have been available with the shorter names, and if you application depends on them, you should (be ashamed and) contact perl5-porters to discuss what are the proper APIs. Certain Perl core C APIs like C are no longer available without their C prefix. If your XS module stops working because some functions cannot be found, in many cases a simple fix is to add the C prefix to the function and the thread context C as the first argument of the function call. This is also how it should always have been done: letting the Perl_-less forms to leak from the core was an accident. For cleaner embedding you can also force this for all APIs by defining at compile time the cpp define PERL_NO_SHORT_NAMES. Perl_save_bool() has been added. Regexp objects (those created with C) now have S-magic rather than R-magic. This fixed regexps of the form /...(??{...;$x})/ to no longer ignore changes made to $x. The S-magic avoids dropping the caching optimization and making (??{...}) constructs obscenely slow (and consequently useless). See also L. Regexp::Copy was affected by this change. The Perl internal debugging macros DEBUG() and DEB() have been renamed to PERL_DEBUG() and PERL_DEB() to avoid namespace conflicts. C<-DL> removed (the leaktest had been broken and unsupported for years, use alternative debugging mallocs or tools like valgrind and Purify). Verbose modifier C added for C<-DXv> and C<-Dsv>, see L. =head1 New Tests In Perl 5.8.0 there were about 69000 separate tests in about 700 test files, in Perl 5.8.1 there are about 77000 separate tests in about 780 test files. The exact numbers depend on the Perl configuration and on the operating system platform. =head1 Known Problems The hash randomisation mentioned in L is definitely problematic: it will wake dormant bugs and shake out bad assumptions. If you want to use mod_perl 2.x with Perl 5.8.1, you will need mod_perl-1.99_10 or higher. Earlier versions of mod_perl 2.x do not work with the randomised hashes. (mod_perl 1.x works fine.) You will also need Apache::Test 1.04 or higher. Many of the rarer platforms that worked 100% or pretty close to it with perl 5.8.0 have been left a little bit untended since their maintainers have been otherwise busy lately, and therefore there will be more failures on those platforms. Such platforms include Mac OS Classic, IBM z/OS (and other EBCDIC platforms), and NetWare. The most common Perl platforms (Unix and Unix-like, Microsoft platforms, and VMS) have large enough testing and expert population that they are doing well. =head2 Tied hashes in scalar context Tied hashes do not currently return anything useful in scalar context, for example when used as boolean tests: if (%tied_hash) { ... } The current nonsensical behaviour is always to return false, regardless of whether the hash is empty or has elements. The root cause is that there is no interface for the implementors of tied hashes to implement the behaviour of a hash in scalar context. =head2 Net::Ping 450_service and 510_ping_udp failures The subtests 9 and 18 of lib/Net/Ping/t/450_service.t, and the subtest 2 of lib/Net/Ping/t/510_ping_udp.t might fail if you have an unusual networking setup. For example in the latter case the test is trying to send a UDP ping to the IP address 127.0.0.1. =head2 B::C The C-generating compiler backend B::C (the frontend being C) is even more broken than it used to be because of the extensive lexical variable changes. (The good news is that B::Bytecode and ByteLoader are better than they used to be.) =head1 Platform Specific Problems =head2 EBCDIC Platforms IBM z/OS and other EBCDIC platforms continue to be problematic regarding Unicode support. Many Unicode tests are skipped when they really should be fixed. =head2 Cygwin 1.5 problems In Cygwin 1.5 the F and F tests have failures for some yet unknown reason. In 1.5.5 the threads tests stress_cv, stress_re, and stress_string are failing unless the environment variable PERLIO is set to "perlio" (which makes also the io/tell failure go away). Perl 5.8.1 does build and work well with Cygwin 1.3: with (uname -a) C a 100% "make test" was achieved with C. =head2 HP-UX: HP cc warnings about sendfile and sendpath With certain HP C compiler releases (e.g. B.11.11.02) you will get many warnings like this (lines wrapped for easier reading): cc: "/usr/include/sys/socket.h", line 504: warning 562: Redeclaration of "sendfile" with a different storage class specifier: "sendfile" will have internal linkage. cc: "/usr/include/sys/socket.h", line 505: warning 562: Redeclaration of "sendpath" with a different storage class specifier: "sendpath" will have internal linkage. The warnings show up both during the build of Perl and during certain lib/ExtUtils tests that invoke the C compiler. The warning, however, is not serious and can be ignored. =head2 IRIX: t/uni/tr_7jis.t falsely failing The test t/uni/tr_7jis.t is known to report failure under 'make test' or the test harness with certain releases of IRIX (at least IRIX 6.5 and MIPSpro Compilers Version 7.3.1.1m), but if run manually the test fully passes. =head2 Mac OS X: no usemymalloc The Perl malloc (C<-Dusemymalloc>) does not work at all in Mac OS X. This is not that serious, though, since the native malloc works just fine. =head2 Tru64: No threaded builds with GNU cc (gcc) In the latest Tru64 releases (e.g. v5.1B or later) gcc cannot be used to compile a threaded Perl (-Duseithreads) because the system C<< >> file doesn't know about gcc. =head2 Win32: sysopen, sysread, syswrite As of the 5.8.0 release, sysopen()/sysread()/syswrite() do not behave like they used to in 5.6.1 and earlier with respect to "text" mode. These built-ins now always operate in "binary" mode (even if sysopen() was passed the O_TEXT flag, or if binmode() was used on the file handle). Note that this issue should only make a difference for disk files, as sockets and pipes have always been in "binary" mode in the Windows port. As this behavior is currently considered a bug, compatible behavior may be re-introduced in a future release. Until then, the use of sysopen(), sysread() and syswrite() is not supported for "text" mode operations. =head1 Future Directions The following things B happen in future. The first publicly available releases having these characteristics will be the developer releases Perl 5.9.x, culminating in the Perl 5.10.0 release. These are our best guesses at the moment: we reserve the right to rethink. =over 4 =item * PerlIO will become The Default. Currently (in Perl 5.8.x) the stdio library is still used if Perl thinks it can use certain tricks to make stdio go B fast. For future releases our goal is to make PerlIO go even faster. =item * A new feature called I will be available. This means that one can have code called assertions sprinkled in the code: usually they are optimised away, but they can be enabled with the C<-A> option. =item * A new operator C (defined-or) will be available. This means that one will be able to say $a // $b instead of defined $a ? $a : $b and $c //= $d; instead of $c = $d unless defined $c; The operator will have the same precedence and associativity as C<||>. A source code patch against the Perl 5.8.1 sources will be available in CPAN as F. =item * C will default to unpacking the C<$_>. =item * Various Copy-On-Write techniques will be investigated in hopes of speeding up Perl. =item * CPANPLUS, Inline, and Module::Build will become core modules. =item * The ability to write true lexically scoped pragmas will be introduced. =item * Work will continue on the bytecompiler and byteloader. =item * v-strings as they currently exist are scheduled to be deprecated. The v-less form (1.2.3) will become a "version object" when used with C, C, and C<$VERSION>. $^V will also be a "version object" so the printf("%vd",...) construct will no longer be needed. The v-ful version (v1.2.3) will become obsolete. The equivalence of strings and v-strings (e.g. that currently 5.8.0 is equal to "\5\8\0") will go away. B, though: it is quite hard to detect when v-strings are being used safely, and when they are not. =item * 5.005 Threads Will Be Removed =item * The C<$*> Variable Will Be Removed (it was deprecated a long time ago) =item * Pseudohashes Will Be Removed =back =head1 Reporting Bugs If you find what you think is a bug, you might check the articles recently posted to the comp.lang.perl.misc newsgroup and the perl bug database at http://bugs.perl.org/ . There may also be information at http://www.perl.com/ , the Perl Home Page. If you believe you have an unreported bug, please run the B program included with your release. Be sure to trim your bug down to a tiny but sufficient test case. Your bug report, along with the output of C, will be sent off to perlbug@perl.org to be analysed by the Perl porting team. You can browse and search the Perl 5 bugs at http://bugs.perl.org/ =head1 SEE ALSO The F file for exhaustive details on what changed. The F file for how to build Perl. The F file for general stuff. The F and F files for copyright information. =cut perlgit.pod000064400000073405147633747270006753 0ustar00=encoding utf8 =for comment Consistent formatting of this file is achieved with: perl ./Porting/podtidy pod/perlgit.pod =head1 NAME perlgit - Detailed information about git and the Perl repository =head1 DESCRIPTION This document provides details on using git to develop Perl. If you are just interested in working on a quick patch, see L first. This document is intended for people who are regular contributors to Perl, including those with write access to the git repository. =head1 CLONING THE REPOSITORY All of Perl's source code is kept centrally in a Git repository at I. You can make a read-only clone of the repository by running: % git clone git://perl5.git.perl.org/perl.git perl This uses the git protocol (port 9418). If you cannot use the git protocol for firewall reasons, you can also clone via http, though this is much slower: % git clone http://perl5.git.perl.org/perl.git perl =head1 WORKING WITH THE REPOSITORY Once you have changed into the repository directory, you can inspect it. After a clone the repository will contain a single local branch, which will be the current branch as well, as indicated by the asterisk. % git branch * blead Using the -a switch to C will also show the remote tracking branches in the repository: % git branch -a * blead origin/HEAD origin/blead ... The branches that begin with "origin" correspond to the "git remote" that you cloned from (which is named "origin"). Each branch on the remote will be exactly tracked by these branches. You should NEVER do work on these remote tracking branches. You only ever do work in a local branch. Local branches can be configured to automerge (on pull) from a designated remote tracking branch. This is the case with the default branch C which will be configured to merge from the remote tracking branch C. You can see recent commits: % git log And pull new changes from the repository, and update your local repository (must be clean first) % git pull Assuming we are on the branch C immediately after a pull, this command would be more or less equivalent to: % git fetch % git merge origin/blead In fact if you want to update your local repository without touching your working directory you do: % git fetch And if you want to update your remote-tracking branches for all defined remotes simultaneously you can do % git remote update Neither of these last two commands will update your working directory, however both will update the remote-tracking branches in your repository. To make a local branch of a remote branch: % git checkout -b maint-5.10 origin/maint-5.10 To switch back to blead: % git checkout blead =head2 Finding out your status The most common git command you will use will probably be % git status This command will produce as output a description of the current state of the repository, including modified files and unignored untracked files, and in addition it will show things like what files have been staged for the next commit, and usually some useful information about how to change things. For instance the following: $ git status # On branch blead # Your branch is ahead of 'origin/blead' by 1 commit. # # Changes to be committed: # (use "git reset HEAD ..." to unstage) # # modified: pod/perlgit.pod # # Changed but not updated: # (use "git add ..." to update what will be committed) # # modified: pod/perlgit.pod # # Untracked files: # (use "git add ..." to include in what will be committed) # # deliberate.untracked This shows that there were changes to this document staged for commit, and that there were further changes in the working directory not yet staged. It also shows that there was an untracked file in the working directory, and as you can see shows how to change all of this. It also shows that there is one commit on the working branch C which has not been pushed to the C remote yet. B: that this output is also what you see as a template if you do not provide a message to C. =head2 Patch workflow First, please read L for details on hacking the Perl core. That document covers many details on how to create a good patch. If you already have a Perl repository, you should ensure that you're on the I branch, and your repository is up to date: % git checkout blead % git pull It's preferable to patch against the latest blead version, since this is where new development occurs for all changes other than critical bug fixes. Critical bug fix patches should be made against the relevant maint branches, or should be submitted with a note indicating all the branches where the fix should be applied. Now that we have everything up to date, we need to create a temporary new branch for these changes and switch into it: % git checkout -b orange which is the short form of % git branch orange % git checkout orange Creating a topic branch makes it easier for the maintainers to rebase or merge back into the master blead for a more linear history. If you don't work on a topic branch the maintainer has to manually cherry pick your changes onto blead before they can be applied. That'll get you scolded on perl5-porters, so don't do that. Be Awesome. Then make your changes. For example, if Leon Brocard changes his name to Orange Brocard, we should change his name in the AUTHORS file: % perl -pi -e 's{Leon Brocard}{Orange Brocard}' AUTHORS You can see what files are changed: % git status # On branch orange # Changes to be committed: # (use "git reset HEAD ..." to unstage) # # modified: AUTHORS # And you can see the changes: % git diff diff --git a/AUTHORS b/AUTHORS index 293dd70..722c93e 100644 --- a/AUTHORS +++ b/AUTHORS @@ -541,7 +541,7 @@ Lars Hecking Laszlo Molnar Leif Huhn Len Johnson -Leon Brocard +Orange Brocard Les Peters Lesley Binks Lincoln D. Stein Now commit your change locally: % git commit -a -m 'Rename Leon Brocard to Orange Brocard' Created commit 6196c1d: Rename Leon Brocard to Orange Brocard 1 files changed, 1 insertions(+), 1 deletions(-) The C<-a> option is used to include all files that git tracks that you have changed. If at this time, you only want to commit some of the files you have worked on, you can omit the C<-a> and use the command C>> before doing the commit. C> allows you to even just commit portions of files instead of all the changes in them. The C<-m> option is used to specify the commit message. If you omit it, git will open a text editor for you to compose the message interactively. This is useful when the changes are more complex than the sample given here, and, depending on the editor, to know that the first line of the commit message doesn't exceed the 50 character legal maximum. Once you've finished writing your commit message and exited your editor, git will write your change to disk and tell you something like this: Created commit daf8e63: explain git status and stuff about remotes 1 files changed, 83 insertions(+), 3 deletions(-) If you re-run C, you should see something like this: % git status # On branch blead # Your branch is ahead of 'origin/blead' by 2 commits. # # Untracked files: # (use "git add ..." to include in what will be committed) # # deliberate.untracked nothing added to commit but untracked files present (use "git add" to track) When in doubt, before you do anything else, check your status and read it carefully, many questions are answered directly by the git status output. You can examine your last commit with: % git show HEAD and if you are not happy with either the description or the patch itself you can fix it up by editing the files once more and then issue: % git commit -a --amend Now you should create a patch file for all your local changes: % git format-patch -M origin.. 0001-Rename-Leon-Brocard-to-Orange-Brocard.patch You should now send an email to L with a description of your changes, and include this patch file as an attachment. In addition to being tracked by RT, mail to perlbug will automatically be forwarded to perl5-porters (with manual moderation, so please be patient). You should only send patches to L directly if the patch is not ready to be applied, but intended for discussion. See the next section for how to configure and use git to send these emails for you. If you want to delete your temporary branch, you may do so with: % git checkout blead % git branch -d orange error: The branch 'orange' is not an ancestor of your current HEAD. If you are sure you want to delete it, run 'git branch -D orange'. % git branch -D orange Deleted branch orange. =head2 Committing your changes Assuming that you'd like to commit all the changes you've made as a single atomic unit, run this command: % git commit -a (That C<-a> tells git to add every file you've changed to this commit. New files aren't automatically added to your commit when you use C If you want to add files or to commit some, but not all of your changes, have a look at the documentation for C.) Git will start up your favorite text editor, so that you can craft a commit message for your change. See L for more information about what makes a good commit message. Once you've finished writing your commit message and exited your editor, git will write your change to disk and tell you something like this: Created commit daf8e63: explain git status and stuff about remotes 1 files changed, 83 insertions(+), 3 deletions(-) If you re-run C, you should see something like this: % git status # On branch blead # Your branch is ahead of 'origin/blead' by 2 commits. # # Untracked files: # (use "git add ..." to include in what will be committed) # # deliberate.untracked nothing added to commit but untracked files present (use "git add" to track) When in doubt, before you do anything else, check your status and read it carefully, many questions are answered directly by the git status output. =head2 Using git to send patch emails Please read L first in order to figure out where your patches should be sent. In your ~/git/perl repository, set the destination email to perl's bug tracker: $ git config sendemail.to perlbug@perl.org Or maybe perl5-porters: $ git config sendemail.to perl5-porters@perl.org Then you can use git directly to send your patch emails: $ git send-email 0001-Rename-Leon-Brocard-to-Orange-Brocard.patch You may need to set some configuration variables for your particular email service provider. For example, to set your global git config to send email via a gmail account: $ git config --global sendemail.smtpserver smtp.gmail.com $ git config --global sendemail.smtpssl 1 $ git config --global sendemail.smtpuser YOURUSERNAME@gmail.com With this configuration, you will be prompted for your gmail password when you run 'git send-email'. You can also configure C with your password if you don't care about having your password in the .gitconfig file. =head2 A note on derived files Be aware that many files in the distribution are derivative--avoid patching them, because git won't see the changes to them, and the build process will overwrite them. Patch the originals instead. Most utilities (like perldoc) are in this category, i.e. patch F rather than F. Similarly, don't create patches for files under $src_root/ext from their copies found in $install_root/lib. If you are unsure about the proper location of a file that may have gotten copied while building the source distribution, consult the C. =head2 Cleaning a working directory The command C can with varying arguments be used as a replacement for C. To reset your working directory to a pristine condition you can do: % git clean -dxf However, be aware this will delete ALL untracked content. You can use % git clean -Xf to remove all ignored untracked files, such as build and test byproduct, but leave any manually created files alone. If you only want to cancel some uncommitted edits, you can use C and give it a list of files to be reverted, or C to revert them all. If you want to cancel one or several commits, you can use C. =head2 Bisecting C provides a built-in way to determine which commit should be blamed for introducing a given bug. C performs a binary search of history to locate the first failing commit. It is fast, powerful and flexible, but requires some setup and to automate the process an auxiliary shell script is needed. The core provides a wrapper program, F, which attempts to simplify as much as possible, making bisecting as simple as running a Perl one-liner. For example, if you want to know when this became an error: perl -e 'my $a := 2' you simply run this: .../Porting/bisect.pl -e 'my $a := 2;' Using C, with one command (and no other files) it's easy to find out =over 4 =item * Which commit caused this example code to break? =item * Which commit caused this example code to start working? =item * Which commit added the first file to match this regex? =item * Which commit removed the last file to match this regex? =back usually without needing to know which versions of perl to use as start and end revisions, as F automatically searches to find the earliest stable version for which the test case passes. Run C for the full documentation, including how to set the C and build time options. If you require more flexibility than F has to offer, you'll need to run C yourself. It's most useful to use C to automate the building and testing of perl revisions. For this you'll need a shell script for C to call to test a particular revision. An example script is F, which you should copy B of the repository, as the bisect process will reset the state to a clean checkout as it runs. The instructions below assume that you copied it as F<~/run> and then edited it as appropriate. You first enter in bisect mode with: % git bisect start For example, if the bug is present on C but wasn't in 5.10.0, C will learn about this when you enter: % git bisect bad % git bisect good perl-5.10.0 Bisecting: 853 revisions left to test after this This results in checking out the median commit between C and C. You can then run the bisecting process with: % git bisect run ~/run When the first bad commit is isolated, C will tell you so: ca4cfd28534303b82a216cfe83a1c80cbc3b9dc5 is first bad commit commit ca4cfd28534303b82a216cfe83a1c80cbc3b9dc5 Author: Dave Mitchell Date: Sat Feb 9 14:56:23 2008 +0000 [perl #49472] Attributes + Unknown Error ... bisect run success You can peek into the bisecting process with C and C. C will get you out of bisect mode. Please note that the first C state must be an ancestor of the first C state. If you want to search for the commit that I some bug, you have to negate your test case (i.e. exit with C<1> if OK and C<0> if not) and still mark the lower bound as C and the upper as C. The "first bad commit" has then to be understood as the "first commit where the bug is solved". C has much more information on how you can tweak your binary searches. =head1 Topic branches and rewriting history Individual committers should create topic branches under B/B. Other committers should check with a topic branch's creator before making any change to it. The simplest way to create a remote topic branch that works on all versions of git is to push the current head as a new branch on the remote, then check it out locally: $ branch="$yourname/$some_descriptive_name" $ git push origin HEAD:$branch $ git checkout -b $branch origin/$branch Users of git 1.7 or newer can do it in a more obvious manner: $ branch="$yourname/$some_descriptive_name" $ git checkout -b $branch $ git push origin -u $branch If you are not the creator of B/B, you might sometimes find that the original author has edited the branch's history. There are lots of good reasons for this. Sometimes, an author might simply be rebasing the branch onto a newer source point. Sometimes, an author might have found an error in an early commit which they wanted to fix before merging the branch to blead. Currently the master repository is configured to forbid non-fast-forward merges. This means that the branches within can not be rebased and pushed as a single step. The only way you will ever be allowed to rebase or modify the history of a pushed branch is to delete it and push it as a new branch under the same name. Please think carefully about doing this. It may be better to sequentially rename your branches so that it is easier for others working with you to cherry-pick their local changes onto the new version. (XXX: needs explanation). If you want to rebase a personal topic branch, you will have to delete your existing topic branch and push as a new version of it. You can do this via the following formula (see the explanation about C's in the git push documentation for details) after you have rebased your branch: # first rebase $ git checkout $user/$topic $ git fetch $ git rebase origin/blead # then "delete-and-push" $ git push origin :$user/$topic $ git push origin $user/$topic B it is forbidden at the repository level to delete any of the "primary" branches. That is any branch matching C. Any attempt to do so will result in git producing an error like this: $ git push origin :blead *** It is forbidden to delete blead/maint branches in this repository error: hooks/update exited with error code 1 error: hook declined to update refs/heads/blead To ssh://perl5.git.perl.org/perl ! [remote rejected] blead (hook declined) error: failed to push some refs to 'ssh://perl5.git.perl.org/perl' As a matter of policy we do B edit the history of the blead and maint-* branches. If a typo (or worse) sneaks into a commit to blead or maint-*, we'll fix it in another commit. The only types of updates allowed on these branches are "fast-forward's", where all history is preserved. Annotated tags in the canonical perl.git repository will never be deleted or modified. Think long and hard about whether you want to push a local tag to perl.git before doing so. (Pushing unannotated tags is not allowed.) =head2 Grafts The perl history contains one mistake which was not caught in the conversion: a merge was recorded in the history between blead and maint-5.10 where no merge actually occurred. Due to the nature of git, this is now impossible to fix in the public repository. You can remove this mis-merge locally by adding the following line to your C<.git/info/grafts> file: 296f12bbbbaa06de9be9d09d3dcf8f4528898a49 434946e0cb7a32589ed92d18008aaa1d88515930 It is particularly important to have this graft line if any bisecting is done in the area of the "merge" in question. =head1 WRITE ACCESS TO THE GIT REPOSITORY Once you have write access, you will need to modify the URL for the origin remote to enable pushing. Edit F<.git/config> with the git-config(1) command: % git config remote.origin.url ssh://perl5.git.perl.org/perl.git You can also set up your user name and e-mail address. Most people do this once globally in their F<~/.gitconfig> by doing something like: % git config --global user.name "Ævar Arnfjörð Bjarmason" % git config --global user.email avarab@gmail.com However if you'd like to override that just for perl then execute then execute something like the following in F: % git config user.email avar@cpan.org It is also possible to keep C as a git remote, and add a new remote for ssh access: % git remote add camel perl5.git.perl.org:/perl.git This allows you to update your local repository by pulling from C, which is faster and doesn't require you to authenticate, and to push your changes back with the C remote: % git fetch camel % git push camel The C command just updates the C refs, as the objects themselves should have been fetched when pulling from C. =head1 Accepting a patch If you have received a patch file generated using the above section, you should try out the patch. First we need to create a temporary new branch for these changes and switch into it: % git checkout -b experimental Patches that were formatted by C are applied with C: % git am 0001-Rename-Leon-Brocard-to-Orange-Brocard.patch Applying Rename Leon Brocard to Orange Brocard If just a raw diff is provided, it is also possible use this two-step process: % git apply bugfix.diff % git commit -a -m "Some fixing" --author="That Guy " Now we can inspect the change: % git show HEAD commit b1b3dab48344cff6de4087efca3dbd63548ab5e2 Author: Leon Brocard Date: Fri Dec 19 17:02:59 2008 +0000 Rename Leon Brocard to Orange Brocard diff --git a/AUTHORS b/AUTHORS index 293dd70..722c93e 100644 --- a/AUTHORS +++ b/AUTHORS @@ -541,7 +541,7 @@ Lars Hecking Laszlo Molnar Leif Huhn Len Johnson -Leon Brocard +Orange Brocard Les Peters Lesley Binks Lincoln D. Stein If you are a committer to Perl and you think the patch is good, you can then merge it into blead then push it out to the main repository: % git checkout blead % git merge experimental % git push If you want to delete your temporary branch, you may do so with: % git checkout blead % git branch -d experimental error: The branch 'experimental' is not an ancestor of your current HEAD. If you are sure you want to delete it, run 'git branch -D experimental'. % git branch -D experimental Deleted branch experimental. =head2 Committing to blead The 'blead' branch will become the next production release of Perl. Before pushing I local change to blead, it's incredibly important that you do a few things, lest other committers come after you with pitchforks and torches: =over =item * Make sure you have a good commit message. See L for details. =item * Run the test suite. You might not think that one typo fix would break a test file. You'd be wrong. Here's an example of where not running the suite caused problems. A patch was submitted that added a couple of tests to an existing .t. It couldn't possibly affect anything else, so no need to test beyond the single affected .t, right? But, the submitter's email address had changed since the last of their submissions, and this caused other tests to fail. Running the test target given in the next item would have caught this problem. =item * If you don't run the full test suite, at least C. This will run basic sanity checks. To see which sanity checks, have a look in F. =item * If you make any changes that affect miniperl or core routines that have different code paths for miniperl, be sure to run C. This will catch problems that even the full test suite will not catch because it runs a subset of tests under miniperl rather than perl. =back =head3 On merging and rebasing Simple, one-off commits pushed to the 'blead' branch should be simple commits that apply cleanly. In other words, you should make sure your work is committed against the current position of blead, so that you can push back to the master repository without merging. Sometimes, blead will move while you're building or testing your changes. When this happens, your push will be rejected with a message like this: To ssh://perl5.git.perl.org/perl.git ! [rejected] blead -> blead (non-fast-forward) error: failed to push some refs to 'ssh://perl5.git.perl.org/perl.git' To prevent you from losing history, non-fast-forward updates were rejected Merge the remote changes (e.g. 'git pull') before pushing again. See the 'Note about fast-forwards' section of 'git push --help' for details. When this happens, you can just I your work against the new position of blead, like this (assuming your remote for the master repository is "p5p"): $ git fetch p5p $ git rebase p5p/blead You will see your commits being re-applied, and you will then be able to push safely. More information about rebasing can be found in the documentation for the git-rebase(1) command. For larger sets of commits that only make sense together, or that would benefit from a summary of the set's purpose, you should use a merge commit. You should perform your work on a L, which you should regularly rebase against blead to ensure that your code is not broken by blead moving. When you have finished your work, please perform a final rebase and test. Linear history is something that gets lost with every commit on blead, but a final rebase makes the history linear again, making it easier for future maintainers to see what has happened. Rebase as follows (assuming your work was on the branch C<< committer/somework >>): $ git checkout committer/somework $ git rebase blead Then you can merge it into master like this: $ git checkout blead $ git merge --no-ff --no-commit committer/somework $ git commit -a The switches above deserve explanation. C<--no-ff> indicates that even if all your work can be applied linearly against blead, a merge commit should still be prepared. This ensures that all your work will be shown as a side branch, with all its commits merged into the mainstream blead by the merge commit. C<--no-commit> means that the merge commit will be I but not I. The commit is then actually performed when you run the next command, which will bring up your editor to describe the commit. Without C<--no-commit>, the commit would be made with nearly no useful message, which would greatly diminish the value of the merge commit as a placeholder for the work's description. When describing the merge commit, explain the purpose of the branch, and keep in mind that this description will probably be used by the eventual release engineer when reviewing the next perldelta document. =head2 Committing to maintenance versions Maintenance versions should only be altered to add critical bug fixes, see L. To commit to a maintenance version of perl, you need to create a local tracking branch: % git checkout --track -b maint-5.005 origin/maint-5.005 This creates a local branch named C, which tracks the remote branch C. Then you can pull, commit, merge and push as before. You can also cherry-pick commits from blead and another branch, by using the C command. It is recommended to use the B<-x> option to C in order to record the SHA1 of the original commit in the new commit message. Before pushing any change to a maint version, make sure you've satisfied the steps in L above. =head2 Merging from a branch via GitHub While we don't encourage the submission of patches via GitHub, that will still happen. Here is a guide to merging patches from a GitHub repository. % git remote add avar git://github.com/avar/perl.git % git fetch avar Now you can see the differences between the branch and blead: % git diff avar/orange And you can see the commits: % git log avar/orange If you approve of a specific commit, you can cherry pick it: % git cherry-pick 0c24b290ae02b2ab3304f51d5e11e85eb3659eae Or you could just merge the whole branch if you like it all: % git merge avar/orange And then push back to the repository: % git push =head2 A note on camel and dromedary The committers have SSH access to the two servers that serve C. One is C itself (I), which is the 'master' repository. The second one is C (I), which can be used for general testing and development. Dromedary syncs the git tree from camel every few minutes, you should not push there. Both machines also have a full CPAN mirror in /srv/CPAN, please use this. To share files with the general public, dromedary serves your ~/public_html/ as C These hosts have fairly strict firewalls to the outside. Outgoing, only rsync, ssh and git are allowed. For http and ftp, you can use http://webproxy:3128 as proxy. Incoming, the firewall tries to detect attacks and blocks IP addresses with suspicious activity. This sometimes (but very rarely) has false positives and you might get blocked. The quickest way to get unblocked is to notify the admins. These two boxes are owned, hosted, and operated by booking.com. You can reach the sysadmins in #p5p on irc.perl.org or via mail to C. perlrequick.pod000064400000042777147633747270007643 0ustar00=head1 NAME perlrequick - Perl regular expressions quick start =head1 DESCRIPTION This page covers the very basics of understanding, creating and using regular expressions ('regexes') in Perl. =head1 The Guide =head2 Simple word matching The simplest regex is simply a word, or more generally, a string of characters. A regex consisting of a word matches any string that contains that word: "Hello World" =~ /World/; # matches In this statement, C is a regex and the C enclosing C tells Perl to search a string for a match. The operator C<=~> associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match. In our case, C matches the second word in C<"Hello World">, so the expression is true. This idea has several variations. Expressions like this are useful in conditionals: print "It matches\n" if "Hello World" =~ /World/; The sense of the match can be reversed by using C operator: print "It doesn't match\n" if "Hello World" !~ /World/; The literal string in the regex can be replaced by a variable: $greeting = "World"; print "It matches\n" if "Hello World" =~ /$greeting/; If you're matching against C<$_>, the C<$_ =~> part can be omitted: $_ = "Hello World"; print "It matches\n" if /World/; Finally, the C default delimiters for a match can be changed to arbitrary delimiters by putting an C<'m'> out front: "Hello World" =~ m!World!; # matches, delimited by '!' "Hello World" =~ m{World}; # matches, note the matching '{}' "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', # '/' becomes an ordinary char Regexes must match a part of the string I in order for the statement to be true: "Hello World" =~ /world/; # doesn't match, case sensitive "Hello World" =~ /o W/; # matches, ' ' is an ordinary char "Hello World" =~ /World /; # doesn't match, no ' ' at end Perl will always match at the earliest possible point in the string: "Hello World" =~ /o/; # matches 'o' in 'Hello' "That hat is red" =~ /hat/; # matches 'hat' in 'That' Not all characters can be used 'as is' in a match. Some characters, called B, are reserved for use in regex notation. The metacharacters are {}[]()^$.|*+?\ A metacharacter can be matched by putting a backslash before it: "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 'C:\WIN32' =~ /C:\\WIN/; # matches "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches In the last regex, the forward slash C<'/'> is also backslashed, because it is used to delimit the regex. Non-printable ASCII characters are represented by B. Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> for a carriage return. Arbitrary bytes are represented by octal escape sequences, e.g., C<\033>, or hexadecimal escape sequences, e.g., C<\x1B>: "1000\t2000" =~ m(0\t2) # matches "cat" =~ /\143\x61\x74/ # matches in ASCII, but a weird way to spell cat Regexes are treated mostly as double-quoted strings, so variable substitution works: $foo = 'house'; 'cathouse' =~ /cat$foo/; # matches 'housecat' =~ /${foo}cat/; # matches With all of the regexes above, if the regex matched anywhere in the string, it was considered a match. To specify I it should match, we would use the B metacharacters C<^> and C<$>. The anchor C<^> means match at the beginning of the string and the anchor C<$> means match at the end of the string, or before a newline at the end of the string. Some examples: "housekeeper" =~ /keeper/; # matches "housekeeper" =~ /^keeper/; # doesn't match "housekeeper" =~ /keeper$/; # matches "housekeeper\n" =~ /keeper$/; # matches "housekeeper" =~ /^housekeeper$/; # matches =head2 Using character classes A B allows a set of possible characters, rather than just a single character, to match at a particular point in a regex. Character classes are denoted by brackets C<[...]>, with the set of characters to be possibly matched inside. Here are some examples: /cat/; # matches 'cat' /[bcr]at/; # matches 'bat', 'cat', or 'rat' "abc" =~ /[cab]/; # matches 'a' In the last statement, even though C<'c'> is the first character in the class, the earliest point at which the regex can match is C<'a'>. /[yY][eE][sS]/; # match 'yes' in a case-insensitive way # 'yes', 'Yes', 'YES', etc. /yes/i; # also match 'yes' in a case-insensitive way The last example shows a match with an C<'i'> B, which makes the match case-insensitive. Character classes also have ordinary and special characters, but the sets of ordinary and special characters inside a character class are different than those outside a character class. The special characters for a character class are C<-]\^$> and are matched using an escape: /[\]c]def/; # matches ']def' or 'cdef' $x = 'bcr'; /[$x]at/; # matches 'bat, 'cat', or 'rat' /[\$x]at/; # matches '$at' or 'xat' /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' The special character C<'-'> acts as a range operator within character classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> become the svelte C<[0-9]> and C<[a-z]>: /item[0-9]/; # matches 'item0' or ... or 'item9' /[0-9a-fA-F]/; # matches a hexadecimal digit If C<'-'> is the first or last character in a character class, it is treated as an ordinary character. The special character C<^> in the first position of a character class denotes a B, which matches any character but those in the brackets. Both C<[...]> and C<[^...]> must match a character, or the match fails. Then /[^a]at/; # doesn't match 'aat' or 'at', but matches # all other 'bat', 'cat, '0at', '%at', etc. /[^0-9]/; # matches a non-numeric character /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary Perl has several abbreviations for common character classes. (These definitions are those that Perl uses in ASCII-safe mode with the C modifier. Otherwise they could match many more non-ASCII Unicode characters as well. See L for details.) =over 4 =item * \d is a digit and represents [0-9] =item * \s is a whitespace character and represents [\ \t\r\n\f] =item * \w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] =item * \D is a negated \d; it represents any character but a digit [^0-9] =item * \S is a negated \s; it represents any non-whitespace character [^\s] =item * \W is a negated \w; it represents any non-word character [^\w] =item * The period '.' matches any character but "\n" =back The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside of character classes. Here are some in use: /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format /[\d\s]/; # matches any digit or whitespace character /\w\W\w/; # matches a word char, followed by a # non-word char, followed by a word char /..rt/; # matches any two chars, followed by 'rt' /end\./; # matches 'end.' /end[.]/; # same thing, matches 'end.' The S > C<\b> matches a boundary between a word character and a non-word character C<\w\W> or C<\W\w>: $x = "Housecat catenates house and cat"; $x =~ /\bcat/; # matches cat in 'catenates' $x =~ /cat\b/; # matches cat in 'housecat' $x =~ /\bcat\b/; # matches 'cat' at end of string In the last example, the end of the string is considered a word boundary. =head2 Matching this or that We can match different character strings with the B metacharacter C<'|'>. To match C or C, we form the regex C. As before, Perl will try to match the regex at the earliest possible point in the string. At each character position, Perl will first try to match the first alternative, C. If C doesn't match, Perl will then try the next alternative, C. If C doesn't match either, then the match fails and Perl moves to the next position in the string. Some examples: "cats and dogs" =~ /cat|dog|bird/; # matches "cat" "cats and dogs" =~ /dog|cat|bird/; # matches "cat" Even though C is the first alternative in the second regex, C is able to match earlier in the string. "cats" =~ /c|ca|cat|cats/; # matches "c" "cats" =~ /cats|cat|ca|c/; # matches "cats" At a given character position, the first alternative that allows the regex match to succeed will be the one that matches. Here, all the alternatives match at the first string position, so the first matches. =head2 Grouping things and hierarchical matching The B metacharacters C<()> allow a part of a regex to be treated as a single unit. Parts of a regex are grouped by enclosing them in parentheses. The regex C means match C followed by either C or C. Some more examples are /(a|b)b/; # matches 'ab' or 'bb' /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere /house(cat|)/; # matches either 'housecat' or 'house' /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or # 'house'. Note groups can be nested. "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', # because '20\d\d' can't match =head2 Extracting matches The grouping metacharacters C<()> also allow the extraction of the parts of a string that matched. For each grouping, the part that matched inside goes into the special variables C<$1>, C<$2>, etc. They can be used just as ordinary variables: # extract hours, minutes, seconds $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format $hours = $1; $minutes = $2; $seconds = $3; In list context, a match C with groupings will return the list of matched values C<($1,$2,...)>. So we could rewrite it as ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); If the groupings in a regex are nested, C<$1> gets the group with the leftmost opening parenthesis, C<$2> the next opening parenthesis, etc. For example, here is a complex regex and the matching variables indicated below it: /(ab(cd|ef)((gi)|j))/; 1 2 34 Associated with the matching variables C<$1>, C<$2>, ... are the B C<\g1>, C<\g2>, ... Backreferences are matching variables that can be used I a regex: /(\w\w\w)\s\g1/; # find sequences like 'the the' in string C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>, C<\g2>, ... only inside a regex. =head2 Matching repetitions The B metacharacters C, C<*>, C<+>, and C<{}> allow us to determine the number of repeats of a portion of a regex we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings: =over 4 =item * C = match 'a' 1 or 0 times =item * C = match 'a' 0 or more times, i.e., any number of times =item * C = match 'a' 1 or more times, i.e., at least once =item * C = match at least C times, but not more than C times. =item * C = match at least C or more times =item * C = match exactly C times =back Here are some examples: /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and # any number of digits /(\w+)\s+\g1/; # match doubled words of arbitrary length $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more # than 4 digits $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3 digit dates These quantifiers will try to match as much of the string as possible, while still allowing the regex to match. So we have $x = 'the cat in the hat'; $x =~ /^(.*)(at)(.*)$/; # matches, # $1 = 'the cat in the h' # $2 = 'at' # $3 = '' (0 matches) The first quantifier C<.*> grabs as much of the string as possible while still having the regex match. The second quantifier C<.*> has no string left to it, so it matches 0 times. =head2 More matching There are a few more things you might want to know about matching operators. The global modifier C allows the matching operator to match within a string as many times as possible. In scalar context, successive matches against a string will have C jump from match to match, keeping track of position in the string as it goes along. You can get or set the position with the C function. For example, $x = "cat dog house"; # 3 words while ($x =~ /(\w+)/g) { print "Word is $1, ends at position ", pos $x, "\n"; } prints Word is cat, ends at position 3 Word is dog, ends at position 7 Word is house, ends at position 13 A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the C, as in C. In list context, C returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex. So @words = ($x =~ /(\w+)/g); # matches, # $word[0] = 'cat' # $word[1] = 'dog' # $word[2] = 'house' =head2 Search and replace Search and replace is performed using C. The C is a Perl double-quoted string that replaces in the string whatever is matched with the C. The operator C<=~> is also used here to associate a string with C. If matching against C<$_>, the S> can be dropped. If there is a match, C returns the number of substitutions made; otherwise it returns false. Here are a few examples: $x = "Time to feed the cat!"; $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" $y = "'quoted words'"; $y =~ s/^'(.*)'$/$1/; # strip single quotes, # $y contains "quoted words" With the C operator, the matched variables C<$1>, C<$2>, etc. are immediately available for use in the replacement expression. With the global modifier, C will search and replace all occurrences of the regex in the string: $x = "I batted 4 for 4"; $x =~ s/4/four/; # $x contains "I batted four for 4" $x = "I batted 4 for 4"; $x =~ s/4/four/g; # $x contains "I batted four for four" The non-destructive modifier C causes the result of the substitution to be returned instead of modifying C<$_> (or whatever variable the substitute was bound to with C<=~>): $x = "I like dogs."; $y = $x =~ s/dogs/cats/r; print "$x $y\n"; # prints "I like dogs. I like cats." $x = "Cats are great."; print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ s/Frogs/Hedgehogs/r, "\n"; # prints "Hedgehogs are great." @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3); # @foo is now qw(X X X 1 2 3) The evaluation modifier C wraps an C around the replacement string and the evaluated result is substituted for the matched substring. Some examples: # reverse all the words in a string $x = "the cat in the hat"; $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" # convert percentage to decimal $x = "A 39% hit rate"; $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" The last example shows that C can use other delimiters, such as C and C, and even C. If single quotes are used C, then the regex and replacement are treated as single-quoted strings. =head2 The split operator C splits C into a list of substrings and returns that list. The regex determines the character sequence that C is split with respect to. For example, to split a string into words, use $x = "Calvin and Hobbes"; @word = split /\s+/, $x; # $word[0] = 'Calvin' # $word[1] = 'and' # $word[2] = 'Hobbes' To extract a comma-delimited list of numbers, use $x = "1.618,2.718, 3.142"; @const = split /,\s*/, $x; # $const[0] = '1.618' # $const[1] = '2.718' # $const[2] = '3.142' If the empty regex C is used, the string is split into individual characters. If the regex has groupings, then the list produced contains the matched substrings from the groupings as well: $x = "/usr/bin"; @parts = split m!(/)!, $x; # $parts[0] = '' # $parts[1] = '/' # $parts[2] = 'usr' # $parts[3] = '/' # $parts[4] = 'bin' Since the first character of $x matched the regex, C prepended an empty initial element to the list. =head1 BUGS None. =head1 SEE ALSO This is just a quick start guide. For a more in-depth tutorial on regexes, see L and for the reference page, see L. =head1 AUTHOR AND COPYRIGHT Copyright (c) 2000 Mark Kvale All rights reserved. This document may be distributed under the same terms as Perl itself. =head2 Acknowledgments The author would like to thank Mark-Jason Dominus, Tom Christiansen, Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful comments. =cut perlunifaq.pod000064400000032502147633747270007444 0ustar00=head1 NAME perlunifaq - Perl Unicode FAQ =head1 Q and A This is a list of questions and answers about Unicode in Perl, intended to be read after L. =head2 perlunitut isn't really a Unicode tutorial, is it? No, and this isn't really a Unicode FAQ. Perl has an abstracted interface for all supported character encodings, so this is actually a generic C tutorial and C FAQ. But many people think that Unicode is special and magical, and I didn't want to disappoint them, so I decided to call the document a Unicode tutorial. =head2 What character encodings does Perl support? To find out which character encodings your Perl supports, run: perl -MEncode -le "print for Encode->encodings(':all')" =head2 Which version of perl should I use? Well, if you can, upgrade to the most recent, but certainly C<5.8.1> or newer. The tutorial and FAQ assume the latest release. You should also check your modules, and upgrade them if necessary. For example, HTML::Entities requires version >= 1.32 to function correctly, even though the changelog is silent about this. =head2 What about binary data, like images? Well, apart from a bare C, you shouldn't treat them specially. (The binmode is needed because otherwise Perl may convert line endings on Win32 systems.) Be careful, though, to never combine text strings with binary strings. If you need text in a binary stream, encode your text strings first using the appropriate encoding, then join them with binary strings. See also: "What if I don't encode?". =head2 When should I decode or encode? Whenever you're communicating text with anything that is external to your perl process, like a database, a text file, a socket, or another program. Even if the thing you're communicating with is also written in Perl. =head2 What if I don't decode? Whenever your encoded, binary string is used together with a text string, Perl will assume that your binary string was encoded with ISO-8859-1, also known as latin-1. If it wasn't latin-1, then your data is unpleasantly converted. For example, if it was UTF-8, the individual bytes of multibyte characters are seen as separate characters, and then again converted to UTF-8. Such double encoding can be compared to double HTML encoding (C<&gt;>), or double URI encoding (C<%253E>). This silent implicit decoding is known as "upgrading". That may sound positive, but it's best to avoid it. =head2 What if I don't encode? Your text string will be sent using the bytes in Perl's internal format. In some cases, Perl will warn you that you're doing something wrong, with a friendly warning: Wide character in print at example.pl line 2. Because the internal format is often UTF-8, these bugs are hard to spot, because UTF-8 is usually the encoding you wanted! But don't be lazy, and don't use the fact that Perl's internal format is UTF-8 to your advantage. Encode explicitly to avoid weird bugs, and to show to maintenance programmers that you thought this through. =head2 Is there a way to automatically decode or encode? If all data that comes from a certain handle is encoded in exactly the same way, you can tell the PerlIO system to automatically decode everything, with the C layer. If you do this, you can't accidentally forget to decode or encode anymore, on things that use the layered handle. You can provide this layer when Cing the file: open my $fh, '>:encoding(UTF-8)', $filename; # auto encoding on write open my $fh, '<:encoding(UTF-8)', $filename; # auto decoding on read Or if you already have an open filehandle: binmode $fh, ':encoding(UTF-8)'; Some database drivers for DBI can also automatically encode and decode, but that is sometimes limited to the UTF-8 encoding. =head2 What if I don't know which encoding was used? Do whatever you can to find out, and if you have to: guess. (Don't forget to document your guess with a comment.) You could open the document in a web browser, and change the character set or character encoding until you can visually confirm that all characters look the way they should. There is no way to reliably detect the encoding automatically, so if people keep sending you data without charset indication, you may have to educate them. =head2 Can I use Unicode in my Perl sources? Yes, you can! If your sources are UTF-8 encoded, you can indicate that with the C pragma. use utf8; This doesn't do anything to your input, or to your output. It only influences the way your sources are read. You can use Unicode in string literals, in identifiers (but they still have to be "word characters" according to C<\w>), and even in custom delimiters. =head2 Data::Dumper doesn't restore the UTF8 flag; is it broken? No, Data::Dumper's Unicode abilities are as they should be. There have been some complaints that it should restore the UTF8 flag when the data is read again with C. However, you should really not look at the flag, and nothing indicates that Data::Dumper should break this rule. Here's what happens: when Perl reads in a string literal, it sticks to 8 bit encoding as long as it can. (But perhaps originally it was internally encoded as UTF-8, when you dumped it.) When it has to give that up because other characters are added to the text string, it silently upgrades the string to UTF-8. If you properly encode your strings for output, none of this is of your concern, and you can just C dumped data as always. =head2 Why do regex character classes sometimes match only in the ASCII range? =head2 Why do some characters not uppercase or lowercase correctly? Starting in Perl 5.14 (and partially in Perl 5.12), just put a C near the beginning of your program. Within its lexical scope you shouldn't have this problem. It also is automatically enabled under C or using C<-E> on the command line for Perl 5.12 or higher. The rationale for requiring this is to not break older programs that rely on the way things worked before Unicode came along. Those older programs knew only about the ASCII character set, and so may not work properly for additional characters. When a string is encoded in UTF-8, Perl assumes that the program is prepared to deal with Unicode, but when the string isn't, Perl assumes that only ASCII (unless it is an EBCDIC platform) is wanted, and so those characters that are not ASCII characters aren't recognized as to what they would be in Unicode. C tells Perl to treat all characters as Unicode, whether the string is encoded in UTF-8 or not, thus avoiding the problem. However, on earlier Perls, or if you pass strings to subroutines outside the feature's scope, you can force Unicode semantics by changing the encoding to UTF-8 by doing C. This can be used safely on any string, as it checks and does not change strings that have already been upgraded. For a more detailed discussion, see L on CPAN. =head2 How can I determine if a string is a text string or a binary string? You can't. Some use the UTF8 flag for this, but that's misuse, and makes well behaved modules like Data::Dumper look bad. The flag is useless for this purpose, because it's off when an 8 bit encoding (by default ISO-8859-1) is used to store the string. This is something you, the programmer, has to keep track of; sorry. You could consider adopting a kind of "Hungarian notation" to help with this. =head2 How do I convert from encoding FOO to encoding BAR? By first converting the FOO-encoded byte string to a text string, and then the text string to a BAR-encoded byte string: my $text_string = decode('FOO', $foo_string); my $bar_string = encode('BAR', $text_string); or by skipping the text string part, and going directly from one binary encoding to the other: use Encode qw(from_to); from_to($string, 'FOO', 'BAR'); # changes contents of $string or by letting automatic decoding and encoding do all the work: open my $foofh, '<:encoding(FOO)', 'example.foo.txt'; open my $barfh, '>:encoding(BAR)', 'example.bar.txt'; print { $barfh } $_ while <$foofh>; =head2 What are C and C? These are alternate syntaxes for C and C. =head2 What is a "wide character"? This is a term used both for characters with an ordinal value greater than 127, characters with an ordinal value greater than 255, or any character occupying more than one byte, depending on the context. The Perl warning "Wide character in ..." is caused by a character with an ordinal value greater than 255. With no specified encoding layer, Perl tries to fit things in ISO-8859-1 for backward compatibility reasons. When it can't, it emits this warning (if warnings are enabled), and outputs UTF-8 encoded data instead. To avoid this warning and to avoid having different output encodings in a single stream, always specify an encoding explicitly, for example with a PerlIO layer: binmode STDOUT, ":encoding(UTF-8)"; =head1 INTERNALS =head2 What is "the UTF8 flag"? Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. That means that you very probably shouldn't use C, C<_utf8_on> or C<_utf8_off> at all. The UTF8 flag, also called SvUTF8, is an internal flag that indicates that the current internal representation is UTF-8. Without the flag, it is assumed to be ISO-8859-1. Perl converts between these automatically. (Actually Perl usually assumes the representation is ASCII; see L above.) One of Perl's internal formats happens to be UTF-8. Unfortunately, Perl can't keep a secret, so everyone knows about this. That is the source of much confusion. It's better to pretend that the internal format is some unknown encoding, and that you always have to encode and decode explicitly. =head2 What about the C pragma? Don't use it. It makes no sense to deal with bytes in a text string, and it makes no sense to deal with characters in a byte string. Do the proper conversions (by decoding/encoding), and things will work out well: you get character counts for decoded data, and byte counts for encoded data. C is usually a failed attempt to do something useful. Just forget about it. =head2 What about the C pragma? Don't use it. Unfortunately, it assumes that the programmer's environment and that of the user will use the same encoding. It will use the same encoding for the source code and for STDIN and STDOUT. When a program is copied to another machine, the source code does not change, but the STDIO environment might. If you need non-ASCII characters in your source code, make it a UTF-8 encoded file and C. If you need to set the encoding for STDIN, STDOUT, and STDERR, for example based on the user's locale, C. =head2 What is the difference between C<:encoding> and C<:utf8>? Because UTF-8 is one of Perl's internal formats, you can often just skip the encoding or decoding step, and manipulate the UTF8 flag directly. Instead of C<:encoding(UTF-8)>, you can simply use C<:utf8>, which skips the encoding step if the data was already represented as UTF8 internally. This is widely accepted as good behavior when you're writing, but it can be dangerous when reading, because it causes internal inconsistency when you have invalid byte sequences. Using C<:utf8> for input can sometimes result in security breaches, so please use C<:encoding(UTF-8)> instead. Instead of C and C, you could use C<_utf8_on> and C<_utf8_off>, but this is considered bad style. Especially C<_utf8_on> can be dangerous, for the same reason that C<:utf8> can. There are some shortcuts for oneliners; see L<-C|perlrun/-C [numberElist]> in L. =head2 What's the difference between C and C? C is the official standard. C is Perl's way of being liberal in what it accepts. If you have to communicate with things that aren't so liberal, you may want to consider using C. If you have to communicate with things that are too liberal, you may have to use C. The full explanation is in L. C is internally known as C. The tutorial uses UTF-8 consistently, even where utf8 is actually used internally, because the distinction can be hard to make, and is mostly irrelevant. For example, utf8 can be used for code points that don't exist in Unicode, like 9999999, but if you encode that to UTF-8, you get a substitution character (by default; see L for more ways of dealing with this.) Okay, if you insist: the "internal format" is utf8, not UTF-8. (When it's not some other encoding.) =head2 I lost track; what encoding is the internal format really? It's good that you lost track, because you shouldn't depend on the internal format being any specific encoding. But since you asked: by default, the internal format is either ISO-8859-1 (latin-1), or utf8, depending on the history of the string. On EBCDIC platforms, this may be different even. Perl knows how it stored the string internally, and will use that knowledge when you C. In other words: don't try to find out what the internal encoding for a certain string is, but instead just encode it into the encoding that you want. =head1 AUTHOR Juerd Waalboer <#####@juerd.nl> =head1 SEE ALSO L, L, L perlgpl.pod000064400000033046147633747270006747 0ustar00 =head1 NAME perlgpl - the GNU General Public License, version 1 =head1 SYNOPSIS You can refer to this document in Pod via "L" Or you can see this document by entering "perldoc perlgpl" =head1 DESCRIPTION Perl is free software; you can redistribute it and/or modify it under the terms of either: a) the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version, or b) the "Artistic License" which comes with this Kit. This is the B<"GNU General Public License, version 1">. It's here so that modules, programs, etc., that want to declare this as their distribution license can link to it. For the Perl Artistic License, see L. =cut # Because the following document's language disallows "changing" # it, we haven't gone thru and prettied it up with =item's or # anything. It's good enough the way it is. =head1 GNU GENERAL PUBLIC LICENSE GNU GENERAL PUBLIC LICENSE Version 1, February 1989 Copyright (C) 1989 Free Software Foundation, Inc. 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The license agreements of most software companies try to keep users at the mercy of those companies. By contrast, our General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. The General Public License applies to the Free Software Foundation's software and to any other program whose authors commit to using it. You can use it for your programs, too. When we speak of free software, we are referring to freedom, not price. Specifically, the General Public License is designed to make sure that you have the freedom to give away or sell copies of free software, that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of a such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must tell them their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License Agreement applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any work containing the Program or a portion of it, either verbatim or with modifications. Each licensee is addressed as "you". 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this General Public License and to the absence of any warranty; and give any other recipients of the Program a copy of this General Public License along with the Program. You may charge a fee for the physical act of transferring a copy. 2. You may modify your copy or copies of the Program or any portion of it, and copy and distribute such modifications under the terms of Paragraph 1 above, provided that you also do the following: a) cause the modified files to carry prominent notices stating that you changed the files and the date of any change; and b) cause the whole of any work that you distribute or publish, that in whole or in part contains the Program or any part thereof, either with or without modifications, to be licensed at no charge to all third parties under the terms of this General Public License (except that you may choose to grant warranty protection to some or all third parties, at your option). c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the simplest and most usual way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this General Public License. d) You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. Mere aggregation of another independent work with the Program (or its derivative) on a volume of a storage or distribution medium does not bring the other work under the scope of these terms. 3. You may copy and distribute the Program (or a portion or derivative of it, under Paragraph 2) in object code or executable form under the terms of Paragraphs 1 and 2 above provided that you also do one of the following: a) accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Paragraphs 1 and 2 above; or, b) accompany it with a written offer, valid for at least three years, to give any third party free (except for a nominal charge for the cost of distribution) a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Paragraphs 1 and 2 above; or, c) accompany it with the information you received as to where the corresponding source code may be obtained. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form alone.) Source code for a work means the preferred form of the work for making modifications to it. For an executable file, complete source code means all the source code for all modules it contains; but, as a special exception, it need not include source code for modules which are standard libraries that accompany the operating system on which the executable file runs, or for standard header files or definitions files that accompany that operating system. 4. You may not copy, modify, sublicense, distribute or transfer the Program except as expressly provided under this General Public License. Any attempt otherwise to copy, modify, sublicense, distribute or transfer the Program is void, and will automatically terminate your rights to use the Program under this License. However, parties who have received copies, or rights to use copies, from you under this General Public License will not have their licenses terminated so long as such parties remain in full compliance. 5. By copying, distributing or modifying the Program (or any work based on the Program) you indicate your acceptance of this license to do so, and all its terms and conditions. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. 7. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of the license which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the license, you may choose any version ever published by the Free Software Foundation. 8. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 9. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 10. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS Appendix: How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to humanity, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) 19yy This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston MA 02110-1301 USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) 19xx name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type 'show w'. This is free software, and you are welcome to redistribute it under certain conditions; type 'show c' for details. The hypothetical commands 'show w' and 'show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than 'show w' and 'show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program 'Gnomovision' (a program to direct compilers to make passes at assemblers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice That's all there is to it! =cut perlebcdic.pod000064400000207154147633747270007401 0ustar00=encoding utf8 =head1 NAME perlebcdic - Considerations for running Perl on EBCDIC platforms =head1 DESCRIPTION An exploration of some of the issues facing Perl programmers on EBCDIC based computers. We do not cover localization, internationalization, or multi-byte character set issues other than some discussion of UTF-8 and UTF-EBCDIC. Portions that are still incomplete are marked with XXX. Perl used to work on EBCDIC machines, but there are now areas of the code where it doesn't. If you want to use Perl on an EBCDIC machine, please let us know by sending mail to perlbug@perl.org =head1 COMMON CHARACTER CODE SETS =head2 ASCII The American Standard Code for Information Interchange (ASCII or US-ASCII) is a set of integers running from 0 to 127 (decimal) that imply character interpretation by the display and other systems of computers. The range 0..127 can be covered by setting the bits in a 7-bit binary digit, hence the set is sometimes referred to as "7-bit ASCII". ASCII was described by the American National Standards Institute document ANSI X3.4-1986. It was also described by ISO 646:1991 (with localization for currency symbols). The full ASCII set is given in the table below as the first 128 elements. Languages that can be written adequately with the characters in ASCII include English, Hawaiian, Indonesian, Swahili and some Native American languages. There are many character sets that extend the range of integers from 0..2**7-1 up to 2**8-1, or 8 bit bytes (octets if you prefer). One common one is the ISO 8859-1 character set. =head2 ISO 8859 The ISO 8859-$n are a collection of character code sets from the International Organization for Standardization (ISO) each of which adds characters to the ASCII set that are typically found in European languages many of which are based on the Roman, or Latin, alphabet. =head2 Latin 1 (ISO 8859-1) A particular 8-bit extension to ASCII that includes grave and acute accented Latin characters. Languages that can employ ISO 8859-1 include all the languages covered by ASCII as well as Afrikaans, Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian, Portuguese, Spanish, and Swedish. Dutch is covered albeit without the ij ligature. French is covered too but without the oe ligature. German can use ISO 8859-1 but must do so without German-style quotation marks. This set is based on Western European extensions to ASCII and is commonly encountered in world wide web work. In IBM character code set identification terminology ISO 8859-1 is also known as CCSID 819 (or sometimes 0819 or even 00819). =head2 EBCDIC The Extended Binary Coded Decimal Interchange Code refers to a large collection of single- and multi-byte coded character sets that are different from ASCII or ISO 8859-1 and are all slightly different from each other; they typically run on host computers. The EBCDIC encodings derive from 8-bit byte extensions of Hollerith punched card encodings. The layout on the cards was such that high bits were set for the upper and lower case alphabet characters [a-z] and [A-Z], but there were gaps within each Latin alphabet range. Some IBM EBCDIC character sets may be known by character code set identification numbers (CCSID numbers) or code page numbers. Perl can be compiled on platforms that run any of three commonly used EBCDIC character sets, listed below. =head2 The 13 variant characters Among IBM EBCDIC character code sets there are 13 characters that are often mapped to different integer values. Those characters are known as the 13 "variant" characters and are: \ [ ] { } ^ ~ ! # | $ @ ` When Perl is compiled for a platform, it looks at some of these characters to guess which EBCDIC character set the platform uses, and adapts itself accordingly to that platform. If the platform uses a character set that is not one of the three Perl knows about, Perl will either fail to compile, or mistakenly and silently choose one of the three. They are: =head2 0037 Character code set ID 0037 is a mapping of the ASCII plus Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used in North American English locales on the OS/400 operating system that runs on AS/400 computers. CCSID 0037 differs from ISO 8859-1 in 237 places, in other words they agree on only 19 code point values. =head2 1047 Character code set ID 1047 is also a mapping of the ASCII plus Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is used under Unix System Services for OS/390 or z/OS, and OpenEdition for VM/ESA. CCSID 1047 differs from CCSID 0037 in eight places. =head2 POSIX-BC The EBCDIC code page in use on Siemens' BS2000 system is distinct from 1047 and 0037. It is identified below as the POSIX-BC set. =head2 Unicode code points versus EBCDIC code points In Unicode terminology a I is the number assigned to a character: for example, in EBCDIC the character "A" is usually assigned the number 193. In Unicode the character "A" is assigned the number 65. This causes a problem with the semantics of the pack/unpack "U", which are supposed to pack Unicode code points to characters and back to numbers. The problem is: which code points to use for code points less than 256? (for 256 and over there's no problem: Unicode code points are used) In EBCDIC, for the low 256 the EBCDIC code points are used. This means that the equivalences pack("U", ord($character)) eq $character unpack("U", $character) == ord $character will hold. (If Unicode code points were applied consistently over all the possible code points, pack("U",ord("A")) would in EBCDIC equal I or chr(101), and unpack("U", "A") would equal 65, or I, not 193, or ord "A".) =head2 Remaining Perl Unicode problems in EBCDIC =over 4 =item * Many of the remaining problems seem to be related to case-insensitive matching =item * The extensions Unicode::Collate and Unicode::Normalized are not supported under EBCDIC, likewise for the encoding pragma. =back =head2 Unicode and UTF UTF stands for C. UTF-8 is an encoding of Unicode into a sequence of 8-bit byte chunks, based on ASCII and Latin-1. The length of a sequence required to represent a Unicode code point depends on the ordinal number of that code point, with larger numbers requiring more bytes. UTF-EBCDIC is like UTF-8, but based on EBCDIC. You may see the term C character or code point. This simply means that the character has the same numeric value when encoded as when not. (Note that this is a very different concept from L mentioned above.) For example, the ordinal value of 'A' is 193 in most EBCDIC code pages, and also is 193 when encoded in UTF-EBCDIC. All variant code points occupy at least two bytes when encoded. In UTF-8, the code points corresponding to the lowest 128 ordinal numbers (0 - 127: the ASCII characters) are invariant. In UTF-EBCDIC, there are 160 invariant characters. (If you care, the EBCDIC invariants are those characters which have ASCII equivalents, plus those that correspond to the C1 controls (80..9f on ASCII platforms).) A string encoded in UTF-EBCDIC may be longer (but never shorter) than one encoded in UTF-8. =head2 Using Encode Starting from Perl 5.8 you can use the standard new module Encode to translate from EBCDIC to Latin-1 code points. Encode knows about more EBCDIC character sets than Perl can currently be compiled to run on. use Encode 'from_to'; my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' ); # $a is in EBCDIC code points from_to($a, $ebcdic{ord '^'}, 'latin1'); # $a is ISO 8859-1 code points and from Latin-1 code points to EBCDIC code points use Encode 'from_to'; my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' ); # $a is ISO 8859-1 code points from_to($a, 'latin1', $ebcdic{ord '^'}); # $a is in EBCDIC code points For doing I/O it is suggested that you use the autotranslating features of PerlIO, see L. Since version 5.8 Perl uses the new PerlIO I/O library. This enables you to use different encodings per IO channel. For example you may use use Encode; open($f, ">:encoding(ascii)", "test.ascii"); print $f "Hello World!\n"; open($f, ">:encoding(cp37)", "test.ebcdic"); print $f "Hello World!\n"; open($f, ">:encoding(latin1)", "test.latin1"); print $f "Hello World!\n"; open($f, ">:encoding(utf8)", "test.utf8"); print $f "Hello World!\n"; to get four files containing "Hello World!\n" in ASCII, CP 0037 EBCDIC, ISO 8859-1 (Latin-1) (in this example identical to ASCII since only ASCII characters were printed), and UTF-EBCDIC (in this example identical to normal EBCDIC since only characters that don't differ between EBCDIC and UTF-EBCDIC were printed). See the documentation of Encode::PerlIO for details. As the PerlIO layer uses raw IO (bytes) internally, all this totally ignores things like the type of your filesystem (ASCII or EBCDIC). =head1 SINGLE OCTET TABLES The following tables list the ASCII and Latin 1 ordered sets including the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f), C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff). In the table non-printing control character names as well as the Latin 1 extensions to ASCII have been labelled with character names roughly corresponding to I albeit with substitutions such as s/LATIN// and s/VULGAR// in all cases, s/CAPITAL LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/ in some other cases. The "names" of the controls listed here are the Unicode Version 1 names, except for the few that don't have names, in which case the names in the Wikipedia article were used (L). The differences between the 0037 and 1047 sets are flagged with ***. The differences between the 1047 and POSIX-BC sets are flagged with ###. All ord() numbers listed are decimal. If you would rather see this table listing octal values then run the table (that is, the pod version of this document since this recipe may not work with a pod2_other_format translation) through: =over 4 =item recipe 0 =back perl -ne 'if(/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ -e '{printf("%s%-9.03o%-9.03o%-9.03o%.03o\n",$1,$2,$3,$4,$5)}' \ perlebcdic.pod If you want to retain the UTF-x code points then in script form you might want to write: =over 4 =item recipe 1 =back open(FH,") { if (/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/) { if ($7 ne '' && $9 ne '') { printf( "%s%-9.03o%-9.03o%-9.03o%-9.03o%-3o.%-5o%-3o.%.03o\n", $1,$2,$3,$4,$5,$6,$7,$8,$9); } elsif ($7 ne '') { printf("%s%-9.03o%-9.03o%-9.03o%-9.03o%-3o.%-5o%.03o\n", $1,$2,$3,$4,$5,$6,$7,$8); } else { printf("%s%-9.03o%-9.03o%-9.03o%-9.03o%-9.03o%.03o\n", $1,$2,$3,$4,$5,$6,$8); } } } If you would rather see this table listing hexadecimal values then run the table through: =over 4 =item recipe 2 =back perl -ne 'if(/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \ -e '{printf("%s%-9.02X%-9.02X%-9.02X%.02X\n",$1,$2,$3,$4,$5)}' \ perlebcdic.pod Or, in order to retain the UTF-x code points in hexadecimal: =over 4 =item recipe 3 =back open(FH,") { if (/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/) { if ($7 ne '' && $9 ne '') { printf( "%s%-9.02X%-9.02X%-9.02X%-9.02X%-2X.%-6.02X%02X.%02X\n", $1,$2,$3,$4,$5,$6,$7,$8,$9); } elsif ($7 ne '') { printf("%s%-9.02X%-9.02X%-9.02X%-9.02X%-2X.%-6.02X%02X\n", $1,$2,$3,$4,$5,$6,$7,$8); } else { printf("%s%-9.02X%-9.02X%-9.02X%-9.02X%-9.02X%02X\n", $1,$2,$3,$4,$5,$6,$8); } } } ISO 8859-1 CCSID CCSID CCSID 1047 chr CCSID 0819 0037 1047 POSIX-BC UTF-8 UTF-EBCDIC ---------------------------------------------------------------------------------------------- 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 55 55 55 4 55 5 45 45 45 5 45 6 46 46 46 6 46 7 47 47 47 7 47 8 22 22 22 8 22 9 5 5 5 9 5 10 37 21 21 10 21 *** 11 11 11 11 11 11

12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 60 60 60 20 60 21 61 61 61 21 61 22 50 50 50 22 50 23 38 38 38 23 38 24 24 24 24 24 24 25 25 25 25 25 25 26 63 63 63 26 63 27 39 39 39 27 39 28 28 28 28 28 28 29 29 29 29 29 29 30 30 30 30 30 30 31 31 31 31 31 31 32 64 64 64 32 64 ! 33 90 90 90 33 90 " 34 127 127 127 34 127 # 35 123 123 123 35 123 $ 36 91 91 91 36 91 % 37 108 108 108 37 108 & 38 80 80 80 38 80 ' 39 125 125 125 39 125 ( 40 77 77 77 40 77 ) 41 93 93 93 41 93 * 42 92 92 92 42 92 + 43 78 78 78 43 78 , 44 107 107 107 44 107 - 45 96 96 96 45 96 . 46 75 75 75 46 75 / 47 97 97 97 47 97 0 48 240 240 240 48 240 1 49 241 241 241 49 241 2 50 242 242 242 50 242 3 51 243 243 243 51 243 4 52 244 244 244 52 244 5 53 245 245 245 53 245 6 54 246 246 246 54 246 7 55 247 247 247 55 247 8 56 248 248 248 56 248 9 57 249 249 249 57 249 : 58 122 122 122 58 122 ; 59 94 94 94 59 94 < 60 76 76 76 60 76 = 61 126 126 126 61 126 > 62 110 110 110 62 110 ? 63 111 111 111 63 111 @ 64 124 124 124 64 124 A 65 193 193 193 65 193 B 66 194 194 194 66 194 C 67 195 195 195 67 195 D 68 196 196 196 68 196 E 69 197 197 197 69 197 F 70 198 198 198 70 198 G 71 199 199 199 71 199 H 72 200 200 200 72 200 I 73 201 201 201 73 201 J 74 209 209 209 74 209 K 75 210 210 210 75 210 L 76 211 211 211 76 211 M 77 212 212 212 77 212 N 78 213 213 213 78 213 O 79 214 214 214 79 214 P 80 215 215 215 80 215 Q 81 216 216 216 81 216 R 82 217 217 217 82 217 S 83 226 226 226 83 226 T 84 227 227 227 84 227 U 85 228 228 228 85 228 V 86 229 229 229 86 229 W 87 230 230 230 87 230 X 88 231 231 231 88 231 Y 89 232 232 232 89 232 Z 90 233 233 233 90 233 [ 91 186 173 187 91 173 *** ### \ 92 224 224 188 92 224 ### ] 93 187 189 189 93 189 *** ^ 94 176 95 106 94 95 *** ### _ 95 109 109 109 95 109 ` 96 121 121 74 96 121 ### a 97 129 129 129 97 129 b 98 130 130 130 98 130 c 99 131 131 131 99 131 d 100 132 132 132 100 132 e 101 133 133 133 101 133 f 102 134 134 134 102 134 g 103 135 135 135 103 135 h 104 136 136 136 104 136 i 105 137 137 137 105 137 j 106 145 145 145 106 145 k 107 146 146 146 107 146 l 108 147 147 147 108 147 m 109 148 148 148 109 148 n 110 149 149 149 110 149 o 111 150 150 150 111 150 p 112 151 151 151 112 151 q 113 152 152 152 113 152 r 114 153 153 153 114 153 s 115 162 162 162 115 162 t 116 163 163 163 116 163 u 117 164 164 164 117 164 v 118 165 165 165 118 165 w 119 166 166 166 119 166 x 120 167 167 167 120 167 y 121 168 168 168 121 168 z 122 169 169 169 122 169 { 123 192 192 251 123 192 ### | 124 79 79 79 124 79 } 125 208 208 253 125 208 ### ~ 126 161 161 255 126 161 ### 127 7 7 7 127 7 128 32 32 32 194.128 32 129 33 33 33 194.129 33 130 34 34 34 194.130 34 131 35 35 35 194.131 35 132 36 36 36 194.132 36 133 21 37 37 194.133 37 *** 134 6 6 6 194.134 6 135 23 23 23 194.135 23 136 40 40 40 194.136 40 137 41 41 41 194.137 41 138 42 42 42 194.138 42 139 43 43 43 194.139 43 140 44 44 44 194.140 44 141 9 9 9 194.141 9 142 10 10 10 194.142 10 143 27 27 27 194.143 27 144 48 48 48 194.144 48 145 49 49 49 194.145 49 146 26 26 26 194.146 26 147 51 51 51 194.147 51 148 52 52 52 194.148 52 149 53 53 53 194.149 53 150 54 54 54 194.150 54 151 8 8 8 194.151 8 152 56 56 56 194.152 56 153 57 57 57 194.153 57 154 58 58 58 194.154 58 155 59 59 59 194.155 59 156 4 4 4 194.156 4 157 20 20 20 194.157 20 158 62 62 62 194.158 62 159 255 255 95 194.159 255 ### 160 65 65 65 194.160 128.65 161 170 170 170 194.161 128.66 162 74 74 176 194.162 128.67 ### 163 177 177 177 194.163 128.68 164 159 159 159 194.164 128.69 165 178 178 178 194.165 128.70 166 106 106 208 194.166 128.71 ###
167 181 181 181 194.167 128.72 168 189 187 121 194.168 128.73 *** ### 169 180 180 180 194.169 128.74 170 154 154 154 194.170 128.81 171 138 138 138 194.171 128.82 172 95 176 186 194.172 128.83 *** ### 173 202 202 202 194.173 128.84 174 175 175 175 194.174 128.85 175 188 188 161 194.175 128.86 ### 176 144 144 144 194.176 128.87 177 143 143 143 194.177 128.88 178 234 234 234 194.178 128.89 179 250 250 250 194.179 128.98 180 190 190 190 194.180 128.99 181 160 160 160 194.181 128.100 182 182 182 182 194.182 128.101 183 179 179 179 194.183 128.102 184 157 157 157 194.184 128.103 185 218 218 218 194.185 128.104 186 155 155 155 194.186 128.105 187 139 139 139 194.187 128.106 188 183 183 183 194.188 128.112 189 184 184 184 194.189 128.113 190 185 185 185 194.190 128.114 191 171 171 171 194.191 128.115 192 100 100 100 195.128 138.65 193 101 101 101 195.129 138.66 194 98 98 98 195.130 138.67 195 102 102 102 195.131 138.68 196 99 99 99 195.132 138.69 197 103 103 103 195.133 138.70 198 158 158 158 195.134 138.71 199 104 104 104 195.135 138.72 200 116 116 116 195.136 138.73 201 113 113 113 195.137 138.74 202 114 114 114 195.138 138.81 203 115 115 115 195.139 138.82 204 120 120 120 195.140 138.83 205 117 117 117 195.141 138.84 206 118 118 118 195.142 138.85 207 119 119 119 195.143 138.86 208 172 172 172 195.144 138.87 209 105 105 105 195.145 138.88 210 237 237 237 195.146 138.89 211 238 238 238 195.147 138.98 212 235 235 235 195.148 138.99 213 239 239 239 195.149 138.100 214 236 236 236 195.150 138.101 215 191 191 191 195.151 138.102 216 128 128 128 195.152 138.103 217 253 253 224 195.153 138.104 ### 218 254 254 254 195.154 138.105 219 251 251 221 195.155 138.106 ### 220 252 252 252 195.156 138.112 221 173 186 173 195.157 138.113 *** ### 222 174 174 174 195.158 138.114 223 89 89 89 195.159 138.115 224 68 68 68 195.160 139.65 225 69 69 69 195.161 139.66 226 66 66 66 195.162 139.67 227 70 70 70 195.163 139.68 228 67 67 67 195.164 139.69 229 71 71 71 195.165 139.70 230 156 156 156 195.166 139.71 231 72 72 72 195.167 139.72 232 84 84 84 195.168 139.73 233 81 81 81 195.169 139.74 234 82 82 82 195.170 139.81 235 83 83 83 195.171 139.82 236 88 88 88 195.172 139.83 237 85 85 85 195.173 139.84 238 86 86 86 195.174 139.85 239 87 87 87 195.175 139.86 240 140 140 140 195.176 139.87 241 73 73 73 195.177 139.88 242 205 205 205 195.178 139.89 243 206 206 206 195.179 139.98 244 203 203 203 195.180 139.99 245 207 207 207 195.181 139.100 246 204 204 204 195.182 139.101 247 225 225 225 195.183 139.102 248 112 112 112 195.184 139.103 249 221 221 192 195.185 139.104 ### 250 222 222 222 195.186 139.105 251 219 219 219 195.187 139.106 252 220 220 220 195.188 139.112 253 141 141 141 195.189 139.113 254 142 142 142 195.190 139.114 255 223 223 223 195.191 139.115 If you would rather see the above table in CCSID 0037 order rather than ASCII + Latin-1 order then run the table through: =over 4 =item recipe 4 =back perl \ -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ -e '{push(@l,$_)}' \ -e 'END{print map{$_->[0]}' \ -e ' sort{$a->[1] <=> $b->[1]}' \ -e ' map{[$_,substr($_,52,3)]}@l;}' perlebcdic.pod If you would rather see it in CCSID 1047 order then change the number 52 in the last line to 61, like this: =over 4 =item recipe 5 =back perl \ -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ -e '{push(@l,$_)}' \ -e 'END{print map{$_->[0]}' \ -e ' sort{$a->[1] <=> $b->[1]}' \ -e ' map{[$_,substr($_,61,3)]}@l;}' perlebcdic.pod If you would rather see it in POSIX-BC order then change the number 61 in the last line to 70, like this: =over 4 =item recipe 6 =back perl \ -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\ -e '{push(@l,$_)}' \ -e 'END{print map{$_->[0]}' \ -e ' sort{$a->[1] <=> $b->[1]}' \ -e ' map{[$_,substr($_,70,3)]}@l;}' perlebcdic.pod =head1 IDENTIFYING CHARACTER CODE SETS To determine the character set you are running under from perl one could use the return value of ord() or chr() to test one or more character values. For example: $is_ascii = "A" eq chr(65); $is_ebcdic = "A" eq chr(193); Also, "\t" is a C character so that: $is_ascii = ord("\t") == 9; $is_ebcdic = ord("\t") == 5; To distinguish EBCDIC code pages try looking at one or more of the characters that differ between them. For example: $is_ebcdic_37 = "\n" eq chr(37); $is_ebcdic_1047 = "\n" eq chr(21); Or better still choose a character that is uniquely encoded in any of the code sets, e.g.: $is_ascii = ord('[') == 91; $is_ebcdic_37 = ord('[') == 186; $is_ebcdic_1047 = ord('[') == 173; $is_ebcdic_POSIX_BC = ord('[') == 187; However, it would be unwise to write tests such as: $is_ascii = "\r" ne chr(13); # WRONG $is_ascii = "\n" ne chr(10); # ILL ADVISED Obviously the first of these will fail to distinguish most ASCII platforms from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC platform since "\r" eq chr(13) under all of those coded character sets. But note too that because "\n" is chr(13) and "\r" is chr(10) on the Macintosh (which is an ASCII platform) the second C<$is_ascii> test will lead to trouble there. To determine whether or not perl was built under an EBCDIC code page you can use the Config module like so: use Config; $is_ebcdic = $Config{'ebcdic'} eq 'define'; =head1 CONVERSIONS =head2 tr/// In order to convert a string of characters from one character set to another a simple list of numbers, such as in the right columns in the above table, along with perl's tr/// operator is all that is needed. The data in the table are in ASCII/Latin1 order, hence the EBCDIC columns provide easy-to-use ASCII/Latin1 to EBCDIC operations that are also easily reversed. For example, to convert ASCII/Latin1 to code page 037 take the output of the second numbers column from the output of recipe 2 (modified to add '\' characters) and use it in tr/// like so: $cp_037 = '\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F' . '\x10\x11\x12\x13\x3C\x3D\x32\x26\x18\x19\x3F\x27\x1C\x1D\x1E\x1F' . '\x40\x5A\x7F\x7B\x5B\x6C\x50\x7D\x4D\x5D\x5C\x4E\x6B\x60\x4B\x61' . '\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\x7A\x5E\x4C\x7E\x6E\x6F' . '\x7C\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xD1\xD2\xD3\xD4\xD5\xD6' . '\xD7\xD8\xD9\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xBA\xE0\xBB\xB0\x6D' . '\x79\x81\x82\x83\x84\x85\x86\x87\x88\x89\x91\x92\x93\x94\x95\x96' . '\x97\x98\x99\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xC0\x4F\xD0\xA1\x07' . '\x20\x21\x22\x23\x24\x15\x06\x17\x28\x29\x2A\x2B\x2C\x09\x0A\x1B' . '\x30\x31\x1A\x33\x34\x35\x36\x08\x38\x39\x3A\x3B\x04\x14\x3E\xFF' . '\x41\xAA\x4A\xB1\x9F\xB2\x6A\xB5\xBD\xB4\x9A\x8A\x5F\xCA\xAF\xBC' . '\x90\x8F\xEA\xFA\xBE\xA0\xB6\xB3\x9D\xDA\x9B\x8B\xB7\xB8\xB9\xAB' . '\x64\x65\x62\x66\x63\x67\x9E\x68\x74\x71\x72\x73\x78\x75\x76\x77' . '\xAC\x69\xED\xEE\xEB\xEF\xEC\xBF\x80\xFD\xFE\xFB\xFC\xAD\xAE\x59' . '\x44\x45\x42\x46\x43\x47\x9C\x48\x54\x51\x52\x53\x58\x55\x56\x57' . '\x8C\x49\xCD\xCE\xCB\xCF\xCC\xE1\x70\xDD\xDE\xDB\xDC\x8D\x8E\xDF'; my $ebcdic_string = $ascii_string; eval '$ebcdic_string =~ tr/\000-\377/' . $cp_037 . '/'; To convert from EBCDIC 037 to ASCII just reverse the order of the tr/// arguments like so: my $ascii_string = $ebcdic_string; eval '$ascii_string =~ tr/' . $cp_037 . '/\000-\377/'; Similarly one could take the output of the third numbers column from recipe 2 to obtain a C<$cp_1047> table. The fourth numbers column of the output from recipe 2 could provide a C<$cp_posix_bc> table suitable for transcoding as well. If you wanted to see the inverse tables, you would first have to sort on the desired numbers column as in recipes 4, 5 or 6, then take the output of the first numbers column. =head2 iconv XPG operability often implies the presence of an I utility available from the shell or from the C library. Consult your system's documentation for information on iconv. On OS/390 or z/OS see the iconv(1) manpage. One way to invoke the iconv shell utility from within perl would be to: # OS/390 or z/OS example $ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1` or the inverse map: # OS/390 or z/OS example $ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047` For other perl-based conversion options see the Convert::* modules on CPAN. =head2 C RTL The OS/390 and z/OS C run-time libraries provide _atoe() and _etoa() functions. =head1 OPERATOR DIFFERENCES The C<..> range operator treats certain character ranges with care on EBCDIC platforms. For example the following array will have twenty six elements on either an EBCDIC platform or an ASCII platform: @alphabet = ('A'..'Z'); # $#alphabet == 25 The bitwise operators such as & ^ | may return different results when operating on string or character data in a perl program running on an EBCDIC platform than when run on an ASCII platform. Here is an example adapted from the one in L: # EBCDIC-based examples print "j p \n" ^ " a h"; # prints "JAPH\n" print "JA" | " ph\n"; # prints "japh\n" print "JAPH\nJunk" & "\277\277\277\277\277"; # prints "japh\n"; print 'p N$' ^ " E eq C<\c@>)> C<(chr(1)> eq C<\cA>)>, and so on. Perl on EBCDIC platforms has been ported to take C<\c@> to chr(0) and C<\cA> to chr(1), etc. as well, but the thirty three characters that result depend on which code page you are using. The table below uses the standard acronyms for the controls. The POSIX-BC and 1047 sets are identical throughout this range and differ from the 0037 set at only one spot (21 decimal). Note that the C character may be generated by C<\cJ> on ASCII platforms but by C<\cU> on 1047 or POSIX-BC platforms and cannot be generated as a C<"\c.letter."> control character on 0037 platforms. Note also that C<\c\> cannot be the final element in a string or regex, as it will absorb the terminator. But C<\c\I> is a C concatenated with I for all I. chr ord 8859-1 0037 1047 && POSIX-BC ----------------------------------------------------------------------- \c? 127 " " \c@ 0 \cA 1 \cB 2 \cC 3 \cD 4 \cE 5 \cF 6 \cG 7 \cH 8 \cI 9 \cJ 10 \cK 11 \cL 12 \cM 13 \cN 14 \cO 15 \cP 16 \cQ 17 \cR 18 \cS 19 \cT 20 \cU 21 *** \cV 22 \cW 23 \cX 24 \cY 25 \cZ 26 \c[ 27 \c\X 28 X X X \c] 29 \c^ 30 \c_ 31 =head1 FUNCTION DIFFERENCES =over 8 =item chr() chr() must be given an EBCDIC code number argument to yield a desired character return value on an EBCDIC platform. For example: $CAPITAL_LETTER_A = chr(193); =item ord() ord() will return EBCDIC code number values on an EBCDIC platform. For example: $the_number_193 = ord("A"); =item pack() The c and C templates for pack() are dependent upon character set encoding. Examples of usage on EBCDIC include: $foo = pack("CCCC",193,194,195,196); # $foo eq "ABCD" $foo = pack("C4",193,194,195,196); # same thing $foo = pack("ccxxcc",193,194,195,196); # $foo eq "AB\0\0CD" =item print() One must be careful with scalars and strings that are passed to print that contain ASCII encodings. One common place for this to occur is in the output of the MIME type header for CGI script writing. For example, many perl programming guides recommend something similar to: print "Content-type:\ttext/html\015\012\015\012"; # this may be wrong on EBCDIC Under the IBM OS/390 USS Web Server or WebSphere on z/OS for example you should instead write that as: print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et al That is because the translation from EBCDIC to ASCII is done by the web server in this case (such code will not be appropriate for the Macintosh however). Consult your web server's documentation for further details. =item printf() The formats that can convert characters to numbers and vice versa will be different from their ASCII counterparts when executed on an EBCDIC platform. Examples include: printf("%c%c%c",193,194,195); # prints ABC =item sort() EBCDIC sort results may differ from ASCII sort results especially for mixed case strings. This is discussed in more detail below. =item sprintf() See the discussion of printf() above. An example of the use of sprintf would be: $CAPITAL_LETTER_A = sprintf("%c",193); =item unpack() See the discussion of pack() above. =back =head1 REGULAR EXPRESSION DIFFERENCES As of perl 5.005_03 the letter range regular expressions such as [A-Z] and [a-z] have been especially coded to not pick up gap characters. For example, characters such as E C that lie between I and J would not be matched by the regular expression range C. This works in the other direction, too, if either of the range end points is explicitly numeric: C<[\x89-\x91]> will match C<\x8e>, even though C<\x89> is C and C<\x91 > is C, and C<\x8e> is a gap character from the alphabetic viewpoint. If you do want to match the alphabet gap characters in a single octet regular expression try matching the hex or octal code such as C on EBCDIC or C on ASCII platforms to have your regular expression match C. Another construct to be wary of is the inappropriate use of hex or octal constants in regular expressions. Consider the following set of subs: sub is_c0 { my $char = substr(shift,0,1); $char =~ /[\000-\037]/; } sub is_print_ascii { my $char = substr(shift,0,1); $char =~ /[\040-\176]/; } sub is_delete { my $char = substr(shift,0,1); $char eq "\177"; } sub is_c1 { my $char = substr(shift,0,1); $char =~ /[\200-\237]/; } sub is_latin_1 { my $char = substr(shift,0,1); $char =~ /[\240-\377]/; } The above would be adequate if the concern was only with numeric code points. However, the concern may be with characters rather than code points and on an EBCDIC platform it may be desirable for constructs such as C to print out the expected message. One way to represent the above collection of character classification subs that is capable of working across the four coded character sets discussed in this document is as follows: sub Is_c0 { my $char = substr(shift,0,1); if (ord('^')==94) { # ascii return $char =~ /[\000-\037]/; } if (ord('^')==176) { # 0037 return $char =~ /[\000-\003\067\055-\057\026\005\045\013-\023\074\075\062\046\030\031\077\047\034-\037]/; } if (ord('^')==95 || ord('^')==106) { # 1047 || posix-bc return $char =~ /[\000-\003\067\055-\057\026\005\025\013-\023\074\075\062\046\030\031\077\047\034-\037]/; } } sub Is_print_ascii { my $char = substr(shift,0,1); $char =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/; } sub Is_delete { my $char = substr(shift,0,1); if (ord('^')==94) { # ascii return $char eq "\177"; } else { # ebcdic return $char eq "\007"; } } sub Is_c1 { my $char = substr(shift,0,1); if (ord('^')==94) { # ascii return $char =~ /[\200-\237]/; } if (ord('^')==176) { # 0037 return $char =~ /[\040-\044\025\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/; } if (ord('^')==95) { # 1047 return $char =~ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/; } if (ord('^')==106) { # posix-bc return $char =~ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\137]/; } } sub Is_latin_1 { my $char = substr(shift,0,1); if (ord('^')==94) { # ascii return $char =~ /[\240-\377]/; } if (ord('^')==176) { # 0037 return $char =~ /[\101\252\112\261\237\262\152\265\275\264\232\212\137\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/; } if (ord('^')==95) { # 1047 return $char =~ /[\101\252\112\261\237\262\152\265\273\264\232\212\260\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\272\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/; } if (ord('^')==106) { # posix-bc return $char =~ /[\101\252\260\261\237\262\320\265\171\264\232\212\272\312\257\241\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\340\376\335\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\300\336\333\334\215\216\337]/; } } Note however that only the C sub is really independent of coded character set. Another way to write C would be to use the characters in the range explicitly: sub Is_latin_1 { my $char = substr(shift,0,1); $char =~ /[ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/; } Although that form may run into trouble in network transit (due to the presence of 8 bit characters) or on non ISO-Latin character sets. =head1 SOCKETS Most socket programming assumes ASCII character encodings in network byte order. Exceptions can include CGI script writing under a host web server where the server may take care of translation for you. Most host web servers convert EBCDIC data to ISO-8859-1 or Unicode on output. =head1 SORTING One big difference between ASCII-based character sets and EBCDIC ones are the relative positions of upper and lower case letters and the letters compared to the digits. If sorted on an ASCII-based platform the two-letter abbreviation for a physician comes before the two letter abbreviation for drive; that is: @sorted = sort(qw(Dr. dr.)); # @sorted holds ('Dr.','dr.') on ASCII, # but ('dr.','Dr.') on EBCDIC The property of lowercase before uppercase letters in EBCDIC is even carried to the Latin 1 EBCDIC pages such as 0037 and 1047. An example would be that E C (203) comes before E C (235) on an ASCII platform, but the latter (83) comes before the former (115) on an EBCDIC platform. (Astute readers will note that the uppercase version of E C is simply "SS" and that the upper case version of E C is not in the 0..255 range but it is at U+x0178 in Unicode, or C<"\x{178}"> in a Unicode enabled Perl). The sort order will cause differences between results obtained on ASCII platforms versus EBCDIC platforms. What follows are some suggestions on how to deal with these differences. =head2 Ignore ASCII vs. EBCDIC sort differences. This is the least computationally expensive strategy. It may require some user education. =head2 MONO CASE then sort data. In order to minimize the expense of mono casing mixed-case text, try to C towards the character set case most employed within the data. If the data are primarily UPPERCASE non Latin 1 then apply tr/[a-z]/[A-Z]/ then sort(). If the data are primarily lowercase non Latin 1 then apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE and include Latin-1 characters then apply: tr/[a-z]/[A-Z]/; tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ/; s/ß/SS/g; then sort(). Do note however that such Latin-1 manipulation does not address the E C character that will remain at code point 255 on ASCII platforms, but 223 on most EBCDIC platforms where it will sort to a place less than the EBCDIC numerals. With a Unicode-enabled Perl you might try: tr/^?/\x{178}/; The strategy of mono casing data before sorting does not preserve the case of the data and may not be acceptable for that reason. =head2 Convert, sort data, then re convert. This is the most expensive proposition that does not employ a network connection. =head2 Perform sorting on one type of platform only. This strategy can employ a network connection. As such it would be computationally expensive. =head1 TRANSFORMATION FORMATS There are a variety of ways of transforming data with an intra character set mapping that serve a variety of purposes. Sorting was discussed in the previous section and a few of the other more popular mapping techniques are discussed next. =head2 URL decoding and encoding Note that some URLs have hexadecimal ASCII code points in them in an attempt to overcome character or protocol limitation issues. For example the tilde character is not on every keyboard hence a URL of the form: http://www.pvhp.com/~pvhp/ may also be expressed as either of: http://www.pvhp.com/%7Epvhp/ http://www.pvhp.com/%7epvhp/ where 7E is the hexadecimal ASCII code point for '~'. Here is an example of decoding such a URL under CCSID 1047: $url = 'http://www.pvhp.com/%7Epvhp/'; # this array assumes code page 1047 my @a2e_1047 = ( 0, 1, 2, 3, 55, 45, 46, 47, 22, 5, 21, 11, 12, 13, 14, 15, 16, 17, 18, 19, 60, 61, 50, 38, 24, 25, 63, 39, 28, 29, 30, 31, 64, 90,127,123, 91,108, 80,125, 77, 93, 92, 78,107, 96, 75, 97, 240,241,242,243,244,245,246,247,248,249,122, 94, 76,126,110,111, 124,193,194,195,196,197,198,199,200,201,209,210,211,212,213,214, 215,216,217,226,227,228,229,230,231,232,233,173,224,189, 95,109, 121,129,130,131,132,133,134,135,136,137,145,146,147,148,149,150, 151,152,153,162,163,164,165,166,167,168,169,192, 79,208,161, 7, 32, 33, 34, 35, 36, 37, 6, 23, 40, 41, 42, 43, 44, 9, 10, 27, 48, 49, 26, 51, 52, 53, 54, 8, 56, 57, 58, 59, 4, 20, 62,255, 65,170, 74,177,159,178,106,181,187,180,154,138,176,202,175,188, 144,143,234,250,190,160,182,179,157,218,155,139,183,184,185,171, 100,101, 98,102, 99,103,158,104,116,113,114,115,120,117,118,119, 172,105,237,238,235,239,236,191,128,253,254,251,252,186,174, 89, 68, 69, 66, 70, 67, 71,156, 72, 84, 81, 82, 83, 88, 85, 86, 87, 140, 73,205,206,203,207,204,225,112,221,222,219,220,141,142,223 ); $url =~ s/%([0-9a-fA-F]{2})/pack("c",$a2e_1047[hex($1)])/ge; Conversely, here is a partial solution for the task of encoding such a URL under the 1047 code page: $url = 'http://www.pvhp.com/~pvhp/'; # this array assumes code page 1047 my @e2a_1047 = ( 0, 1, 2, 3,156, 9,134,127,151,141,142, 11, 12, 13, 14, 15, 16, 17, 18, 19,157, 10, 8,135, 24, 25,146,143, 28, 29, 30, 31, 128,129,130,131,132,133, 23, 27,136,137,138,139,140, 5, 6, 7, 144,145, 22,147,148,149,150, 4,152,153,154,155, 20, 21,158, 26, 32,160,226,228,224,225,227,229,231,241,162, 46, 60, 40, 43,124, 38,233,234,235,232,237,238,239,236,223, 33, 36, 42, 41, 59, 94, 45, 47,194,196,192,193,195,197,199,209,166, 44, 37, 95, 62, 63, 248,201,202,203,200,205,206,207,204, 96, 58, 35, 64, 39, 61, 34, 216, 97, 98, 99,100,101,102,103,104,105,171,187,240,253,254,177, 176,106,107,108,109,110,111,112,113,114,170,186,230,184,198,164, 181,126,115,116,117,118,119,120,121,122,161,191,208, 91,222,174, 172,163,165,183,169,167,182,188,189,190,221,168,175, 93,180,215, 123, 65, 66, 67, 68, 69, 70, 71, 72, 73,173,244,246,242,243,245, 125, 74, 75, 76, 77, 78, 79, 80, 81, 82,185,251,252,249,250,255, 92,247, 83, 84, 85, 86, 87, 88, 89, 90,178,212,214,210,211,213, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57,179,219,220,217,218,159 ); # The following regular expression does not address the # mappings for: ('.' => '%2E', '/' => '%2F', ':' => '%3A') $url =~ s/([\t "#%&\(\),;<=>\?\@\[\\\]^`{|}~])/sprintf("%%%02X",$e2a_1047[ord($1)])/ge; where a more complete solution would split the URL into components and apply a full s/// substitution only to the appropriate parts. In the remaining examples a @e2a or @a2e array may be employed but the assignment will not be shown explicitly. For code page 1047 you could use the @a2e_1047 or @e2a_1047 arrays just shown. =head2 uu encoding and decoding The C template to pack() or unpack() will render EBCDIC data in EBCDIC characters equivalent to their ASCII counterparts. For example, the following will print "Yes indeed\n" on either an ASCII or EBCDIC computer: $all_byte_chrs = ''; for (0..255) { $all_byte_chrs .= chr($_); } $uuencode_byte_chrs = pack('u', $all_byte_chrs); ($uu = <<'ENDOFHEREDOC') =~ s/^\s*//gm; M``$"`P0%!@<("0H+#`T.#Q`1$A,4%187&!D:&QP='A\@(2(C)"4F)R@I*BLL M+2XO,#$R,S0U-C'EZ>WQ]?G^`@8*#A(6& MAXB)BHN,C8Z/D)&2DY25EI>8F9J;G)V>GZ"AHJ.DI::GJ*FJJZRMKJ^PL;*S MM+6VM[BYNKN\O;Z_P,'"P\3%QL?(R+CY.7FY^CIZNOL[>[O\/'R\_3U]O?X^?K[_/W^_P`` ENDOFHEREDOC if ($uuencode_byte_chrs eq $uu) { print "Yes "; } $uudecode_byte_chrs = unpack('u', $uuencode_byte_chrs); if ($uudecode_byte_chrs eq $all_byte_chrs) { print "indeed\n"; } Here is a very spartan uudecoder that will work on EBCDIC provided that the @e2a array is filled in appropriately: #!/usr/local/bin/perl @e2a = ( # this must be filled in ); $_ = <> until ($mode,$file) = /^begin\s*(\d*)\s*(\S*)/; open(OUT, "> $file") if $file ne ""; while(<>) { last if /^end/; next if /[a-z]/; next unless int(((($e2a[ord()] - 32 ) & 077) + 2) / 3) == int(length() / 4); print OUT unpack("u", $_); } close(OUT); chmod oct($mode), $file; =head2 Quoted-Printable encoding and decoding On ASCII-encoded platforms it is possible to strip characters outside of the printable set using: # This QP encoder works on ASCII only $qp_string =~ s/([=\x00-\x1F\x80-\xFF])/sprintf("=%02X",ord($1))/ge; Whereas a QP encoder that works on both ASCII and EBCDIC platforms would look somewhat like the following (where the EBCDIC branch @e2a array is omitted for brevity): if (ord('A') == 65) { # ASCII $delete = "\x7F"; # ASCII @e2a = (0 .. 255) # ASCII to ASCII identity map } else { # EBCDIC $delete = "\x07"; # EBCDIC @e2a = # EBCDIC to ASCII map (as shown above) } $qp_string =~ s/([^ !"\#\$%&'()*+,\-.\/0-9:;<>?\@A-Z[\\\]^_`a-z{|}~$delete])/sprintf("=%02X",$e2a[ord($1)])/ge; (although in production code the substitutions might be done in the EBCDIC branch with the @e2a array and separately in the ASCII branch without the expense of the identity map). Such QP strings can be decoded with: # This QP decoder is limited to ASCII only $string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr hex $1/ge; $string =~ s/=[\n\r]+$//; Whereas a QP decoder that works on both ASCII and EBCDIC platforms would look somewhat like the following (where the @a2e array is omitted for brevity): $string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr $a2e[hex $1]/ge; $string =~ s/=[\n\r]+$//; =head2 Caesarean ciphers The practice of shifting an alphabet one or more characters for encipherment dates back thousands of years and was explicitly detailed by Gaius Julius Caesar in his B text. A single alphabet shift is sometimes referred to as a rotation and the shift amount is given as a number $n after the string 'rot' or "rot$n". Rot0 and rot26 would designate identity maps on the 26-letter English version of the Latin alphabet. Rot13 has the interesting property that alternate subsequent invocations are identity maps (thus rot13 is its own non-trivial inverse in the group of 26 alphabet rotations). Hence the following is a rot13 encoder and decoder that will work on ASCII and EBCDIC platforms: #!/usr/local/bin/perl while(<>){ tr/n-za-mN-ZA-M/a-zA-Z/; print; } In one-liner form: perl -ne 'tr/n-za-mN-ZA-M/a-zA-Z/;print' =head1 Hashing order and checksums To the extent that it is possible to write code that depends on hashing order there may be differences between hashes as stored on an ASCII-based platform and hashes stored on an EBCDIC-based platform. XXX =head1 I18N AND L10N Internationalization (I18N) and localization (L10N) are supported at least in principle even on EBCDIC platforms. The details are system-dependent and discussed under the L section below. =head1 MULTI-OCTET CHARACTER SETS Perl may work with an internal UTF-EBCDIC encoding form for wide characters on EBCDIC platforms in a manner analogous to the way that it works with the UTF-8 internal encoding form on ASCII based platforms. Legacy multi byte EBCDIC code pages XXX. =head1 OS ISSUES There may be a few system-dependent issues of concern to EBCDIC Perl programmers. =head2 OS/400 =over 8 =item PASE The PASE environment is a runtime environment for OS/400 that can run executables built for PowerPC AIX in OS/400; see L. PASE is ASCII-based, not EBCDIC-based as the ILE. =item IFS access XXX. =back =head2 OS/390, z/OS Perl runs under Unix Systems Services or USS. =over 8 =item chcp B is supported as a shell utility for displaying and changing one's code page. See also L. =item dataset access For sequential data set access try: my @ds_records = `cat //DSNAME`; or: my @ds_records = `cat //'HLQ.DSNAME'`; See also the OS390::Stdio module on CPAN. =item OS/390, z/OS iconv B is supported as both a shell utility and a C RTL routine. See also the iconv(1) and iconv(3) manual pages. =item locales On OS/390 or z/OS see L for information on locales. The L10N files are in F. $Config{d_setlocale} is 'define' on OS/390 or z/OS. =back =head2 VM/ESA? XXX. =head2 POSIX-BC? XXX. =head1 BUGS This pod document contains literal Latin 1 characters and may encounter translation difficulties. In particular one popular nroff implementation was known to strip accented characters to their unaccented counterparts while attempting to view this document through the B program (for example, you may see a plain C rather than one with a diaeresis as in E). Another nroff truncated the resultant manpage at the first occurrence of 8 bit characters. Not all shells will allow multiple C<-e> string arguments to perl to be concatenated together properly as recipes 0, 2, 4, 5, and 6 might seem to imply. =head1 SEE ALSO L, L, L, L. =head1 REFERENCES L L L L B Tom Jennings, September 1999. B The Unicode Consortium, Lisa Moore ed., ISBN 0-201-61633-5, Addison Wesley Developers Press, February 2000. B, IBM SC09-2190-00, December 1996. "Demystifying Character Sets", Andrea Vine, Multilingual Computing & Technology, B<#26 Vol. 10 Issue 4>, August/September 1999; ISSN 1523-0309; Multilingual Computing Inc. Sandpoint ID, USA. B Fred B. Wrixon, ISBN 1-57912-040-7, Black Dog & Leventhal Publishers, 1998. L B Robert Bemer. =head1 HISTORY 15 April 2001: added UTF-8 and UTF-EBCDIC to main table, pvhp. =head1 AUTHOR Peter Prymmer pvhp@best.com wrote this in 1999 and 2000 with CCSID 0819 and 0037 help from Chris Leach and AndrE Pirard A.Pirard@ulg.ac.be as well as POSIX-BC help from Thomas Dorner Thomas.Dorner@start.de. Thanks also to Vickie Cooper, Philip Newton, William Raffloer, and Joe Smith. Trademarks, registered trademarks, service marks and registered service marks used in this document are the property of their respective owners. perlsec.pod000064400000055422147633747270006741 0ustar00=head1 NAME perlsec - Perl security =head1 DESCRIPTION Perl is designed to make it easy to program securely even when running with extra privileges, like setuid or setgid programs. Unlike most command line shells, which are based on multiple substitution passes on each line of the script, Perl uses a more conventional evaluation scheme with fewer hidden snags. Additionally, because the language has more builtin functionality, it can rely less upon external (and possibly untrustworthy) programs to accomplish its purposes. =head1 SECURITY VULNERABILITY CONTACT INFORMATION If you believe you have found a security vulnerability in Perl, please email perl5-security-report@perl.org with details. This points to a closed subscription, unarchived mailing list. Please only use this address for security issues in the Perl core, not for modules independently distributed on CPAN. =head1 SECURITY MECHANISMS AND CONCERNS =head2 Taint mode Perl automatically enables a set of special security checks, called I, when it detects its program running with differing real and effective user or group IDs. The setuid bit in Unix permissions is mode 04000, the setgid bit mode 02000; either or both may be set. You can also enable taint mode explicitly by using the B<-T> command line flag. This flag is I suggested for server programs and any program run on behalf of someone else, such as a CGI script. Once taint mode is on, it's on for the remainder of your script. While in this mode, Perl takes special precautions called I to prevent both obvious and subtle traps. Some of these checks are reasonably simple, such as verifying that path directories aren't writable by others; careful programmers have always used checks like these. Other checks, however, are best supported by the language itself, and it is these checks especially that contribute to making a set-id Perl program more secure than the corresponding C program. You may not use data derived from outside your program to affect something else outside your program--at least, not by accident. All command line arguments, environment variables, locale information (see L), results of certain system calls (C, C, the variable of C, the messages returned by C, the password, gcos and shell fields returned by the C calls), and all file input are marked as "tainted". Tainted data may not be used directly or indirectly in any command that invokes a sub-shell, nor in any command that modifies files, directories, or processes, B: =over 4 =item * Arguments to C and C are B checked for taintedness. =item * Symbolic methods $obj->$method(@args); and symbolic sub references &{$foo}(@args); $foo->(@args); are not checked for taintedness. This requires extra carefulness unless you want external data to affect your control flow. Unless you carefully limit what these symbolic values are, people are able to call functions B your Perl code, such as POSIX::system, in which case they are able to run arbitrary external code. =item * Hash keys are B tainted. =back For efficiency reasons, Perl takes a conservative view of whether data is tainted. If an expression contains tainted data, any subexpression may be considered tainted, even if the value of the subexpression is not itself affected by the tainted data. Because taintedness is associated with each scalar value, some elements of an array or hash can be tainted and others not. The keys of a hash are B tainted. For example: $arg = shift; # $arg is tainted $hid = $arg, 'bar'; # $hid is also tainted $line = <>; # Tainted $line = ; # Also tainted open FOO, "/home/me/bar" or die $!; $line = ; # Still tainted $path = $ENV{'PATH'}; # Tainted, but see below $data = 'abc'; # Not tainted system "echo $arg"; # Insecure system "/bin/echo", $arg; # Considered insecure # (Perl doesn't know about /bin/echo) system "echo $hid"; # Insecure system "echo $data"; # Insecure until PATH set $path = $ENV{'PATH'}; # $path now tainted $ENV{'PATH'} = '/bin:/usr/bin'; delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV'}; $path = $ENV{'PATH'}; # $path now NOT tainted system "echo $data"; # Is secure now! open(FOO, "< $arg"); # OK - read-only file open(FOO, "> $arg"); # Not OK - trying to write open(FOO,"echo $arg|"); # Not OK open(FOO,"-|") or exec 'echo', $arg; # Also not OK $shout = `echo $arg`; # Insecure, $shout now tainted unlink $data, $arg; # Insecure umask $arg; # Insecure exec "echo $arg"; # Insecure exec "echo", $arg; # Insecure exec "sh", '-c', $arg; # Very insecure! @files = <*.c>; # insecure (uses readdir() or similar) @files = glob('*.c'); # insecure (uses readdir() or similar) # In Perl releases older than 5.6.0 the <*.c> and glob('*.c') would # have used an external program to do the filename expansion; but in # either case the result is tainted since the list of filenames comes # from outside of the program. $bad = ($arg, 23); # $bad will be tainted $arg, `true`; # Insecure (although it isn't really) If you try to do something insecure, you will get a fatal error saying something like "Insecure dependency" or "Insecure $ENV{PATH}". The exception to the principle of "one tainted value taints the whole expression" is with the ternary conditional operator C. Since code with a ternary conditional $result = $tainted_value ? "Untainted" : "Also untainted"; is effectively if ( $tainted_value ) { $result = "Untainted"; } else { $result = "Also untainted"; } it doesn't make sense for C<$result> to be tainted. =head2 Laundering and Detecting Tainted Data To test whether a variable contains tainted data, and whose use would thus trigger an "Insecure dependency" message, you can use the C function of the Scalar::Util module, available in your nearby CPAN mirror, and included in Perl starting from the release 5.8.0. Or you may be able to use the following C function. sub is_tainted { local $@; # Don't pollute caller's value. return ! eval { eval("#" . substr(join("", @_), 0, 0)); 1 }; } This function makes use of the fact that the presence of tainted data anywhere within an expression renders the entire expression tainted. It would be inefficient for every operator to test every argument for taintedness. Instead, the slightly more efficient and conservative approach is used that if any tainted value has been accessed within the same expression, the whole expression is considered tainted. But testing for taintedness gets you only so far. Sometimes you have just to clear your data's taintedness. Values may be untainted by using them as keys in a hash; otherwise the only way to bypass the tainting mechanism is by referencing subpatterns from a regular expression match. Perl presumes that if you reference a substring using $1, $2, etc., that you knew what you were doing when you wrote the pattern. That means using a bit of thought--don't just blindly untaint anything, or you defeat the entire mechanism. It's better to verify that the variable has only good characters (for certain values of "good") rather than checking whether it has any bad characters. That's because it's far too easy to miss bad characters that you never thought of. Here's a test to make sure that the data contains nothing but "word" characters (alphabetics, numerics, and underscores), a hyphen, an at sign, or a dot. if ($data =~ /^([-\@\w.]+)$/) { $data = $1; # $data now untainted } else { die "Bad data in '$data'"; # log this somewhere } This is fairly secure because C doesn't normally match shell metacharacters, nor are dot, dash, or at going to mean something special to the shell. Use of C would have been insecure in theory because it lets everything through, but Perl doesn't check for that. The lesson is that when untainting, you must be exceedingly careful with your patterns. Laundering data using regular expression is the I mechanism for untainting dirty data, unless you use the strategy detailed below to fork a child of lesser privilege. The example does not untaint C<$data> if C is in effect, because the characters matched by C<\w> are determined by the locale. Perl considers that locale definitions are untrustworthy because they contain data from outside the program. If you are writing a locale-aware program, and want to launder data with a regular expression containing C<\w>, put C ahead of the expression in the same block. See L for further discussion and examples. =head2 Switches On the "#!" Line When you make a script executable, in order to make it usable as a command, the system will pass switches to perl from the script's #! line. Perl checks that any command line switches given to a setuid (or setgid) script actually match the ones set on the #! line. Some Unix and Unix-like environments impose a one-switch limit on the #! line, so you may need to use something like C<-wU> instead of C<-w -U> under such systems. (This issue should arise only in Unix or Unix-like environments that support #! and setuid or setgid scripts.) =head2 Taint mode and @INC When the taint mode (C<-T>) is in effect, the "." directory is removed from C<@INC>, and the environment variables C and C are ignored by Perl. You can still adjust C<@INC> from outside the program by using the C<-I> command line option as explained in L. The two environment variables are ignored because they are obscured, and a user running a program could be unaware that they are set, whereas the C<-I> option is clearly visible and therefore permitted. Another way to modify C<@INC> without modifying the program, is to use the C pragma, e.g.: perl -Mlib=/foo program The benefit of using C<-Mlib=/foo> over C<-I/foo>, is that the former will automagically remove any duplicated directories, while the later will not. Note that if a tainted string is added to C<@INC>, the following problem will be reported: Insecure dependency in require while running with -T switch =head2 Cleaning Up Your Path For "Insecure C<$ENV{PATH}>" messages, you need to set C<$ENV{'PATH'}> to a known value, and each directory in the path must be absolute and non-writable by others than its owner and group. You may be surprised to get this message even if the pathname to your executable is fully qualified. This is I generated because you didn't supply a full path to the program; instead, it's generated because you never set your PATH environment variable, or you didn't set it to something that was safe. Because Perl can't guarantee that the executable in question isn't itself going to turn around and execute some other program that is dependent on your PATH, it makes sure you set the PATH. The PATH isn't the only environment variable which can cause problems. Because some shells may use the variables IFS, CDPATH, ENV, and BASH_ENV, Perl checks that those are either empty or untainted when starting subprocesses. You may wish to add something like this to your setid and taint-checking scripts. delete @ENV{qw(IFS CDPATH ENV BASH_ENV)}; # Make %ENV safer It's also possible to get into trouble with other operations that don't care whether they use tainted values. Make judicious use of the file tests in dealing with any user-supplied filenames. When possible, do opens and such B properly dropping any special user (or group!) privileges. Perl doesn't prevent you from opening tainted filenames for reading, so be careful what you print out. The tainting mechanism is intended to prevent stupid mistakes, not to remove the need for thought. Perl does not call the shell to expand wild cards when you pass C and C explicit parameter lists instead of strings with possible shell wildcards in them. Unfortunately, the C, C, and backtick functions provide no such alternate calling convention, so more subterfuge will be required. Perl provides a reasonably safe way to open a file or pipe from a setuid or setgid program: just create a child process with reduced privilege who does the dirty work for you. First, fork a child using the special C syntax that connects the parent and child by a pipe. Now the child resets its ID set and any other per-process attributes, like environment variables, umasks, current working directories, back to the originals or known safe values. Then the child process, which no longer has any special permissions, does the C or other system call. Finally, the child passes the data it managed to access back to the parent. Because the file or pipe was opened in the child while running under less privilege than the parent, it's not apt to be tricked into doing something it shouldn't. Here's a way to do backticks reasonably safely. Notice how the C is not called with a string that the shell could expand. This is by far the best way to call something that might be subjected to shell escapes: just never call the shell at all. use English '-no_match_vars'; die "Can't fork: $!" unless defined($pid = open(KID, "-|")); if ($pid) { # parent while () { # do something } close KID; } else { my @temp = ($EUID, $EGID); my $orig_uid = $UID; my $orig_gid = $GID; $EUID = $UID; $EGID = $GID; # Drop privileges $UID = $orig_uid; $GID = $orig_gid; # Make sure privs are really gone ($EUID, $EGID) = @temp; die "Can't drop privileges" unless $UID == $EUID && $GID eq $EGID; $ENV{PATH} = "/bin:/usr/bin"; # Minimal PATH. # Consider sanitizing the environment even more. exec 'myprog', 'arg1', 'arg2' or die "can't exec myprog: $!"; } A similar strategy would work for wildcard expansion via C, although you can use C instead. Taint checking is most useful when although you trust yourself not to have written a program to give away the farm, you don't necessarily trust those who end up using it not to try to trick it into doing something bad. This is the kind of security checking that's useful for set-id programs and programs launched on someone else's behalf, like CGI programs. This is quite different, however, from not even trusting the writer of the code not to try to do something evil. That's the kind of trust needed when someone hands you a program you've never seen before and says, "Here, run this." For that kind of safety, you might want to check out the Safe module, included standard in the Perl distribution. This module allows the programmer to set up special compartments in which all system operations are trapped and namespace access is carefully controlled. Safe should not be considered bullet-proof, though: it will not prevent the foreign code to set up infinite loops, allocate gigabytes of memory, or even abusing perl bugs to make the host interpreter crash or behave in unpredictable ways. In any case it's better avoided completely if you're really concerned about security. =head2 Security Bugs Beyond the obvious problems that stem from giving special privileges to systems as flexible as scripts, on many versions of Unix, set-id scripts are inherently insecure right from the start. The problem is a race condition in the kernel. Between the time the kernel opens the file to see which interpreter to run and when the (now-set-id) interpreter turns around and reopens the file to interpret it, the file in question may have changed, especially if you have symbolic links on your system. Fortunately, sometimes this kernel "feature" can be disabled. Unfortunately, there are two ways to disable it. The system can simply outlaw scripts with any set-id bit set, which doesn't help much. Alternately, it can simply ignore the set-id bits on scripts. However, if the kernel set-id script feature isn't disabled, Perl will complain loudly that your set-id script is insecure. You'll need to either disable the kernel set-id script feature, or put a C wrapper around the script. A C wrapper is just a compiled program that does nothing except call your Perl program. Compiled programs are not subject to the kernel bug that plagues set-id scripts. Here's a simple wrapper, written in C: #define REAL_PATH "/path/to/script" main(ac, av) char **av; { execv(REAL_PATH, av); } Compile this wrapper into a binary executable and then make I rather than your script setuid or setgid. In recent years, vendors have begun to supply systems free of this inherent security bug. On such systems, when the kernel passes the name of the set-id script to open to the interpreter, rather than using a pathname subject to meddling, it instead passes I. This is a special file already opened on the script, so that there can be no race condition for evil scripts to exploit. On these systems, Perl should be compiled with C<-DSETUID_SCRIPTS_ARE_SECURE_NOW>. The F program that builds Perl tries to figure this out for itself, so you should never have to specify this yourself. Most modern releases of SysVr4 and BSD 4.4 use this approach to avoid the kernel race condition. =head2 Protecting Your Programs There are a number of ways to hide the source to your Perl programs, with varying levels of "security". First of all, however, you I take away read permission, because the source code has to be readable in order to be compiled and interpreted. (That doesn't mean that a CGI script's source is readable by people on the web, though.) So you have to leave the permissions at the socially friendly 0755 level. This lets people on your local system only see your source. Some people mistakenly regard this as a security problem. If your program does insecure things, and relies on people not knowing how to exploit those insecurities, it is not secure. It is often possible for someone to determine the insecure things and exploit them without viewing the source. Security through obscurity, the name for hiding your bugs instead of fixing them, is little security indeed. You can try using encryption via source filters (Filter::* from CPAN, or Filter::Util::Call and Filter::Simple since Perl 5.8). But crackers might be able to decrypt it. You can try using the byte code compiler and interpreter described below, but crackers might be able to de-compile it. You can try using the native-code compiler described below, but crackers might be able to disassemble it. These pose varying degrees of difficulty to people wanting to get at your code, but none can definitively conceal it (this is true of every language, not just Perl). If you're concerned about people profiting from your code, then the bottom line is that nothing but a restrictive license will give you legal security. License your software and pepper it with threatening statements like "This is unpublished proprietary software of XYZ Corp. Your access to it does not give you permission to use it blah blah blah." You should see a lawyer to be sure your license's wording will stand up in court. =head2 Unicode Unicode is a new and complex technology and one may easily overlook certain security pitfalls. See L for an overview and L for details, and L for security implications in particular. =head2 Algorithmic Complexity Attacks Certain internal algorithms used in the implementation of Perl can be attacked by choosing the input carefully to consume large amounts of either time or space or both. This can lead into the so-called I (DoS) attacks. =over 4 =item * Hash Function - the algorithm used to "order" hash elements has been changed several times during the development of Perl, mainly to be reasonably fast. In Perl 5.8.1 also the security aspect was taken into account. In Perls before 5.8.1 one could rather easily generate data that as hash keys would cause Perl to consume large amounts of time because internal structure of hashes would badly degenerate. In Perl 5.8.1 the hash function is randomly perturbed by a pseudorandom seed which makes generating such naughty hash keys harder. See L for more information. In Perl 5.8.1 the random perturbation was done by default, but as of 5.8.2 it is only used on individual hashes if the internals detect the insertion of pathological data. If one wants for some reason emulate the old behaviour (and expose oneself to DoS attacks) one can set the environment variable PERL_HASH_SEED to zero to disable the protection (or any other integer to force a known perturbation, rather than random). One possible reason for wanting to emulate the old behaviour is that in the new behaviour consecutive runs of Perl will order hash keys differently, which may confuse some applications (like Data::Dumper: the outputs of two different runs are no longer identical). B, and the ordering has already changed several times during the lifetime of Perl 5. Also, the ordering of hash keys has always been, and continues to be, affected by the insertion order. Also note that while the order of the hash elements might be randomised, this "pseudoordering" should B be used for applications like shuffling a list randomly (use List::Util::shuffle() for that, see L, a standard core module since Perl 5.8.0; or the CPAN module Algorithm::Numerical::Shuffle), or for generating permutations (use e.g. the CPAN modules Algorithm::Permute or Algorithm::FastPermute), or for any cryptographic applications. =item * Regular expressions - Perl's regular expression engine is so called NFA (Non-deterministic Finite Automaton), which among other things means that it can rather easily consume large amounts of both time and space if the regular expression may match in several ways. Careful crafting of the regular expressions can help but quite often there really isn't much one can do (the book "Mastering Regular Expressions" is required reading, see L). Running out of space manifests itself by Perl running out of memory. =item * Sorting - the quicksort algorithm used in Perls before 5.8.0 to implement the sort() function is very easy to trick into misbehaving so that it consumes a lot of time. Starting from Perl 5.8.0 a different sorting algorithm, mergesort, is used by default. Mergesort cannot misbehave on any input. =back See L for more information, and any computer science textbook on algorithmic complexity. =head1 SEE ALSO L for its description of cleaning up environment variables. perlhacktips.pod000064400000132776147633747270010005 0ustar00 =encoding utf8 =for comment Consistent formatting of this file is achieved with: perl ./Porting/podtidy pod/perlhacktips.pod =head1 NAME perlhacktips - Tips for Perl core C code hacking =head1 DESCRIPTION This document will help you learn the best way to go about hacking on the Perl core C code. It covers common problems, debugging, profiling, and more. If you haven't read L and L yet, you might want to do that first. =head1 COMMON PROBLEMS Perl source plays by ANSI C89 rules: no C99 (or C++) extensions. In some cases we have to take pre-ANSI requirements into consideration. You don't care about some particular platform having broken Perl? I hear there is still a strong demand for J2EE programmers. =head2 Perl environment problems =over 4 =item * Not compiling with threading Compiling with threading (-Duseithreads) completely rewrites the function prototypes of Perl. You better try your changes with that. Related to this is the difference between "Perl_-less" and "Perl_-ly" APIs, for example: Perl_sv_setiv(aTHX_ ...); sv_setiv(...); The first one explicitly passes in the context, which is needed for e.g. threaded builds. The second one does that implicitly; do not get them mixed. If you are not passing in a aTHX_, you will need to do a dTHX (or a dVAR) as the first thing in the function. See L for further discussion about context. =item * Not compiling with -DDEBUGGING The DEBUGGING define exposes more code to the compiler, therefore more ways for things to go wrong. You should try it. =item * Introducing (non-read-only) globals Do not introduce any modifiable globals, truly global or file static. They are bad form and complicate multithreading and other forms of concurrency. The right way is to introduce them as new interpreter variables, see F (at the very end for binary compatibility). Introducing read-only (const) globals is okay, as long as you verify with e.g. C (if your C has BSD-style output) that the data you added really is read-only. (If it is, it shouldn't show up in the output of that command.) If you want to have static strings, make them constant: static const char etc[] = "..."; If you want to have arrays of constant strings, note carefully the right combination of Cs: static const char * const yippee[] = {"hi", "ho", "silver"}; There is a way to completely hide any modifiable globals (they are all moved to heap), the compilation setting C<-DPERL_GLOBAL_STRUCT_PRIVATE>. It is not normally used, but can be used for testing, read more about it in L. =item * Not exporting your new function Some platforms (Win32, AIX, VMS, OS/2, to name a few) require any function that is part of the public API (the shared Perl library) to be explicitly marked as exported. See the discussion about F in L. =item * Exporting your new function The new shiny result of either genuine new functionality or your arduous refactoring is now ready and correctly exported. So what could possibly go wrong? Maybe simply that your function did not need to be exported in the first place. Perl has a long and not so glorious history of exporting functions that it should not have. If the function is used only inside one source code file, make it static. See the discussion about F in L. If the function is used across several files, but intended only for Perl's internal use (and this should be the common case), do not export it to the public API. See the discussion about F in L. =back =head2 Portability problems The following are common causes of compilation and/or execution failures, not common to Perl as such. The C FAQ is good bedtime reading. Please test your changes with as many C compilers and platforms as possible; we will, anyway, and it's nice to save oneself from public embarrassment. If using gcc, you can add the C<-std=c89> option which will hopefully catch most of these unportabilities. (However it might also catch incompatibilities in your system's header files.) Use the Configure C<-Dgccansipedantic> flag to enable the gcc C<-ansi -pedantic> flags which enforce stricter ANSI rules. If using the C note that not all the possible warnings (like C<-Wunitialized>) are given unless you also compile with C<-O>. Note that if using gcc, starting from Perl 5.9.5 the Perl core source code files (the ones at the top level of the source code distribution, but not e.g. the extensions under ext/) are automatically compiled with as many as possible of the C<-std=c89>, C<-ansi>, C<-pedantic>, and a selection of C<-W> flags (see cflags.SH). Also study L carefully to avoid any bad assumptions about the operating system, filesystems, and so forth. You may once in a while try a "make microperl" to see whether we can still compile Perl with just the bare minimum of interfaces. (See README.micro.) Do not assume an operating system indicates a certain compiler. =over 4 =item * Casting pointers to integers or casting integers to pointers void castaway(U8* p) { IV i = p; or void castaway(U8* p) { IV i = (IV)p; Both are bad, and broken, and unportable. Use the PTR2IV() macro that does it right. (Likewise, there are PTR2UV(), PTR2NV(), INT2PTR(), and NUM2PTR().) =item * Casting between data function pointers and data pointers Technically speaking casting between function pointers and data pointers is unportable and undefined, but practically speaking it seems to work, but you should use the FPTR2DPTR() and DPTR2FPTR() macros. Sometimes you can also play games with unions. =item * Assuming sizeof(int) == sizeof(long) There are platforms where longs are 64 bits, and platforms where ints are 64 bits, and while we are out to shock you, even platforms where shorts are 64 bits. This is all legal according to the C standard. (In other words, "long long" is not a portable way to specify 64 bits, and "long long" is not even guaranteed to be any wider than "long".) Instead, use the definitions IV, UV, IVSIZE, I32SIZE, and so forth. Avoid things like I32 because they are B guaranteed to be I 32 bits, they are I 32 bits, nor are they guaranteed to be B or B. If you really explicitly need 64-bit variables, use I64 and U64, but only if guarded by HAS_QUAD. =item * Assuming one can dereference any type of pointer for any type of data char *p = ...; long pony = *p; /* BAD */ Many platforms, quite rightly so, will give you a core dump instead of a pony if the p happens not to be correctly aligned. =item * Lvalue casts (int)*p = ...; /* BAD */ Simply not portable. Get your lvalue to be of the right type, or maybe use temporary variables, or dirty tricks with unions. =item * Assume B about structs (especially the ones you don't control, like the ones coming from the system headers) =over 8 =item * That a certain field exists in a struct =item * That no other fields exist besides the ones you know of =item * That a field is of certain signedness, sizeof, or type =item * That the fields are in a certain order =over 8 =item * While C guarantees the ordering specified in the struct definition, between different platforms the definitions might differ =back =item * That the sizeof(struct) or the alignments are the same everywhere =over 8 =item * There might be padding bytes between the fields to align the fields - the bytes can be anything =item * Structs are required to be aligned to the maximum alignment required by the fields - which for native types is for usually equivalent to sizeof() of the field =back =back =item * Assuming the character set is ASCIIish Perl can compile and run under EBCDIC platforms. See L. This is transparent for the most part, but because the character sets differ, you shouldn't use numeric (decimal, octal, nor hex) constants to refer to characters. You can safely say 'A', but not 0x41. You can safely say '\n', but not \012. If a character doesn't have a trivial input form, you can create a #define for it in both C and C, so that it resolves to different values depending on the character set being used. (There are three different EBCDIC character sets defined in C, so it might be best to insert the #define three times in that file.) Also, the range 'A' - 'Z' in ASCII is an unbroken sequence of 26 upper case alphabetic characters. That is not true in EBCDIC. Nor for 'a' to 'z'. But '0' - '9' is an unbroken range in both systems. Don't assume anything about other ranges. Many of the comments in the existing code ignore the possibility of EBCDIC, and may be wrong therefore, even if the code works. This is actually a tribute to the successful transparent insertion of being able to handle EBCDIC without having to change pre-existing code. UTF-8 and UTF-EBCDIC are two different encodings used to represent Unicode code points as sequences of bytes. Macros with the same names (but different definitions) in C and C are used to allow the calling code to think that there is only one such encoding. This is almost always referred to as C, but it means the EBCDIC version as well. Again, comments in the code may well be wrong even if the code itself is right. For example, the concept of C differs between ASCII and EBCDIC. On ASCII platforms, only characters that do not have the high-order bit set (i.e. whose ordinals are strict ASCII, 0 - 127) are invariant, and the documentation and comments in the code may assume that, often referring to something like, say, C. The situation differs and is not so simple on EBCDIC machines, but as long as the code itself uses the C macro appropriately, it works, even if the comments are wrong. =item * Assuming the character set is just ASCII ASCII is a 7 bit encoding, but bytes have 8 bits in them. The 128 extra characters have different meanings depending on the locale. Absent a locale, currently these extra characters are generally considered to be unassigned, and this has presented some problems. This is being changed starting in 5.12 so that these characters will be considered to be Latin-1 (ISO-8859-1). =item * Mixing #define and #ifdef #define BURGLE(x) ... \ #ifdef BURGLE_OLD_STYLE /* BAD */ ... do it the old way ... \ #else ... do it the new way ... \ #endif You cannot portably "stack" cpp directives. For example in the above you need two separate BURGLE() #defines, one for each #ifdef branch. =item * Adding non-comment stuff after #endif or #else #ifdef SNOSH ... #else !SNOSH /* BAD */ ... #endif SNOSH /* BAD */ The #endif and #else cannot portably have anything non-comment after them. If you want to document what is going (which is a good idea especially if the branches are long), use (C) comments: #ifdef SNOSH ... #else /* !SNOSH */ ... #endif /* SNOSH */ The gcc option C<-Wendif-labels> warns about the bad variant (by default on starting from Perl 5.9.4). =item * Having a comma after the last element of an enum list enum color { CERULEAN, CHARTREUSE, CINNABAR, /* BAD */ }; is not portable. Leave out the last comma. Also note that whether enums are implicitly morphable to ints varies between compilers, you might need to (int). =item * Using //-comments // This function bamfoodles the zorklator. /* BAD */ That is C99 or C++. Perl is C89. Using the //-comments is silently allowed by many C compilers but cranking up the ANSI C89 strictness (which we like to do) causes the compilation to fail. =item * Mixing declarations and code void zorklator() { int n = 3; set_zorkmids(n); /* BAD */ int q = 4; That is C99 or C++. Some C compilers allow that, but you shouldn't. The gcc option C<-Wdeclaration-after-statements> scans for such problems (by default on starting from Perl 5.9.4). =item * Introducing variables inside for() for(int i = ...; ...; ...) { /* BAD */ That is C99 or C++. While it would indeed be awfully nice to have that also in C89, to limit the scope of the loop variable, alas, we cannot. =item * Mixing signed char pointers with unsigned char pointers int foo(char *s) { ... } ... unsigned char *t = ...; /* Or U8* t = ... */ foo(t); /* BAD */ While this is legal practice, it is certainly dubious, and downright fatal in at least one platform: for example VMS cc considers this a fatal error. One cause for people often making this mistake is that a "naked char" and therefore dereferencing a "naked char pointer" have an undefined signedness: it depends on the compiler and the flags of the compiler and the underlying platform whether the result is signed or unsigned. For this very same reason using a 'char' as an array index is bad. =item * Macros that have string constants and their arguments as substrings of the string constants #define FOO(n) printf("number = %d\n", n) /* BAD */ FOO(10); Pre-ANSI semantics for that was equivalent to printf("10umber = %d\10"); which is probably not what you were expecting. Unfortunately at least one reasonably common and modern C compiler does "real backward compatibility" here, in AIX that is what still happens even though the rest of the AIX compiler is very happily C89. =item * Using printf formats for non-basic C types IV i = ...; printf("i = %d\n", i); /* BAD */ While this might by accident work in some platform (where IV happens to be an C), in general it cannot. IV might be something larger. Even worse the situation is with more specific types (defined by Perl's configuration step in F): Uid_t who = ...; printf("who = %d\n", who); /* BAD */ The problem here is that Uid_t might be not only not C-wide but it might also be unsigned, in which case large uids would be printed as negative values. There is no simple solution to this because of printf()'s limited intelligence, but for many types the right format is available as with either 'f' or '_f' suffix, for example: IVdf /* IV in decimal */ UVxf /* UV is hexadecimal */ printf("i = %"IVdf"\n", i); /* The IVdf is a string constant. */ Uid_t_f /* Uid_t in decimal */ printf("who = %"Uid_t_f"\n", who); Or you can try casting to a "wide enough" type: printf("i = %"IVdf"\n", (IV)something_very_small_and_signed); Also remember that the C<%p> format really does require a void pointer: U8* p = ...; printf("p = %p\n", (void*)p); The gcc option C<-Wformat> scans for such problems. =item * Blindly using variadic macros gcc has had them for a while with its own syntax, and C99 brought them with a standardized syntax. Don't use the former, and use the latter only if the HAS_C99_VARIADIC_MACROS is defined. =item * Blindly passing va_list Not all platforms support passing va_list to further varargs (stdarg) functions. The right thing to do is to copy the va_list using the Perl_va_copy() if the NEED_VA_COPY is defined. =item * Using gcc statement expressions val = ({...;...;...}); /* BAD */ While a nice extension, it's not portable. The Perl code does admittedly use them if available to gain some extra speed (essentially as a funky form of inlining), but you shouldn't. =item * Binding together several statements in a macro Use the macros STMT_START and STMT_END. STMT_START { ... } STMT_END =item * Testing for operating systems or versions when should be testing for features #ifdef __FOONIX__ /* BAD */ foo = quux(); #endif Unless you know with 100% certainty that quux() is only ever available for the "Foonix" operating system B that is available B correctly working for B past, present, B future versions of "Foonix", the above is very wrong. This is more correct (though still not perfect, because the below is a compile-time check): #ifdef HAS_QUUX foo = quux(); #endif How does the HAS_QUUX become defined where it needs to be? Well, if Foonix happens to be Unixy enough to be able to run the Configure script, and Configure has been taught about detecting and testing quux(), the HAS_QUUX will be correctly defined. In other platforms, the corresponding configuration step will hopefully do the same. In a pinch, if you cannot wait for Configure to be educated, or if you have a good hunch of where quux() might be available, you can temporarily try the following: #if (defined(__FOONIX__) || defined(__BARNIX__)) # define HAS_QUUX #endif ... #ifdef HAS_QUUX foo = quux(); #endif But in any case, try to keep the features and operating systems separate. =back =head2 Problematic System Interfaces =over 4 =item * malloc(0), realloc(0), calloc(0, 0) are non-portable. To be portable allocate at least one byte. (In general you should rarely need to work at this low level, but instead use the various malloc wrappers.) =item * snprintf() - the return type is unportable. Use my_snprintf() instead. =back =head2 Security problems Last but not least, here are various tips for safer coding. =over 4 =item * Do not use gets() Or we will publicly ridicule you. Seriously. =item * Do not use strcpy() or strcat() or strncpy() or strncat() Use my_strlcpy() and my_strlcat() instead: they either use the native implementation, or Perl's own implementation (borrowed from the public domain implementation of INN). =item * Do not use sprintf() or vsprintf() If you really want just plain byte strings, use my_snprintf() and my_vsnprintf() instead, which will try to use snprintf() and vsnprintf() if those safer APIs are available. If you want something fancier than a plain byte string, use SVs and Perl_sv_catpvf(). =back =head1 DEBUGGING You can compile a special debugging version of Perl, which allows you to use the C<-D> option of Perl to tell more about what Perl is doing. But sometimes there is no alternative than to dive in with a debugger, either to see the stack trace of a core dump (very useful in a bug report), or trying to figure out what went wrong before the core dump happened, or how did we end up having wrong or unexpected results. =head2 Poking at Perl To really poke around with Perl, you'll probably want to build Perl for debugging, like this: ./Configure -d -D optimize=-g make C<-g> is a flag to the C compiler to have it produce debugging information which will allow us to step through a running program, and to see in which C function we are at (without the debugging information we might see only the numerical addresses of the functions, which is not very helpful). F will also turn on the C compilation symbol which enables all the internal debugging code in Perl. There are a whole bunch of things you can debug with this: L lists them all, and the best way to find out about them is to play about with them. The most useful options are probably l Context (loop) stack processing t Trace execution o Method and overloading resolution c String/numeric conversions Some of the functionality of the debugging code can be achieved using XS modules. -Dr => use re 'debug' -Dx => use O 'Debug' =head2 Using a source-level debugger If the debugging output of C<-D> doesn't help you, it's time to step through perl's execution with a source-level debugger. =over 3 =item * We'll use C for our examples here; the principles will apply to any debugger (many vendors call their debugger C), but check the manual of the one you're using. =back To fire up the debugger, type gdb ./perl Or if you have a core dump: gdb ./perl core You'll want to do that in your Perl source tree so the debugger can read the source code. You should see the copyright message, followed by the prompt. (gdb) C will get you into the documentation, but here are the most useful commands: =over 3 =item * run [args] Run the program with the given arguments. =item * break function_name =item * break source.c:xxx Tells the debugger that we'll want to pause execution when we reach either the named function (but see L!) or the given line in the named source file. =item * step Steps through the program a line at a time. =item * next Steps through the program a line at a time, without descending into functions. =item * continue Run until the next breakpoint. =item * finish Run until the end of the current function, then stop again. =item * 'enter' Just pressing Enter will do the most recent operation again - it's a blessing when stepping through miles of source code. =item * print Execute the given C code and print its results. B: Perl makes heavy use of macros, and F does not necessarily support macros (see later L). You'll have to substitute them yourself, or to invoke cpp on the source code files (see L) So, for instance, you can't say print SvPV_nolen(sv) but you have to say print Perl_sv_2pv_nolen(sv) =back You may find it helpful to have a "macro dictionary", which you can produce by saying C. Even then, F won't recursively apply those macros for you. =head2 gdb macro support Recent versions of F have fairly good macro support, but in order to use it you'll need to compile perl with macro definitions included in the debugging information. Using F version 3.1, this means configuring with C<-Doptimize=-g3>. Other compilers might use a different switch (if they support debugging macros at all). =head2 Dumping Perl Data Structures One way to get around this macro hell is to use the dumping functions in F; these work a little like an internal L, but they also cover OPs and other structures that you can't get at from Perl. Let's take an example. We'll use the C<$a = $b + $c> we used before, but give it a bit of context: C<$b = "6XXXX"; $c = 2.3;>. Where's a good place to stop and poke around? What about C, the function we examined earlier to implement the C<+> operator: (gdb) break Perl_pp_add Breakpoint 1 at 0x46249f: file pp_hot.c, line 309. Notice we use C and not C - see L. With the breakpoint in place, we can run our program: (gdb) run -e '$b = "6XXXX"; $c = 2.3; $a = $b + $c' Lots of junk will go past as gdb reads in the relevant source files and libraries, and then: Breakpoint 1, Perl_pp_add () at pp_hot.c:309 309 dSP; dATARGET; tryAMAGICbin(add,opASSIGN); (gdb) step 311 dPOPTOPnnrl_ul; (gdb) We looked at this bit of code before, and we said that C arranges for two Cs to be placed into C and C - let's slightly expand it: #define dPOPTOPnnrl_ul NV right = POPn; \ SV *leftsv = TOPs; \ NV left = USE_LEFT(leftsv) ? SvNV(leftsv) : 0.0 C takes the SV from the top of the stack and obtains its NV either directly (if C is set) or by calling the C function. C takes the next SV from the top of the stack - yes, C uses C - but doesn't remove it. We then use C to get the NV from C in the same way as before - yes, C uses C. Since we don't have an NV for C<$b>, we'll have to use C to convert it. If we step again, we'll find ourselves there: Perl_sv_2nv (sv=0xa0675d0) at sv.c:1669 1669 if (!sv) (gdb) We can now use C to investigate the SV: SV = PV(0xa057cc0) at 0xa0675d0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0xa06a510 "6XXXX"\0 CUR = 5 LEN = 6 $1 = void We know we're going to get C<6> from this, so let's finish the subroutine: (gdb) finish Run till exit from #0 Perl_sv_2nv (sv=0xa0675d0) at sv.c:1671 0x462669 in Perl_pp_add () at pp_hot.c:311 311 dPOPTOPnnrl_ul; We can also dump out this op: the current op is always stored in C, and we can dump it with C. This'll give us similar output to L. { 13 TYPE = add ===> 14 TARG = 1 FLAGS = (SCALAR,KIDS) { TYPE = null ===> (12) (was rv2sv) FLAGS = (SCALAR,KIDS) { 11 TYPE = gvsv ===> 12 FLAGS = (SCALAR) GV = main::b } } # finish this later # =head1 SOURCE CODE STATIC ANALYSIS Various tools exist for analysing C source code B, as opposed to B, that is, without executing the code. It is possible to detect resource leaks, undefined behaviour, type mismatches, portability problems, code paths that would cause illegal memory accesses, and other similar problems by just parsing the C code and looking at the resulting graph, what does it tell about the execution and data flows. As a matter of fact, this is exactly how C compilers know to give warnings about dubious code. =head2 lint, splint The good old C code quality inspector, C, is available in several platforms, but please be aware that there are several different implementations of it by different vendors, which means that the flags are not identical across different platforms. There is a lint variant called C (Secure Programming Lint) available from http://www.splint.org/ that should compile on any Unix-like platform. There are C and targets in Makefile, but you may have to diddle with the flags (see above). =head2 Coverity Coverity (http://www.coverity.com/) is a product similar to lint and as a testbed for their product they periodically check several open source projects, and they give out accounts to open source developers to the defect databases. =head2 cpd (cut-and-paste detector) The cpd tool detects cut-and-paste coding. If one instance of the cut-and-pasted code changes, all the other spots should probably be changed, too. Therefore such code should probably be turned into a subroutine or a macro. cpd (http://pmd.sourceforge.net/cpd.html) is part of the pmd project (http://pmd.sourceforge.net/). pmd was originally written for static analysis of Java code, but later the cpd part of it was extended to parse also C and C++. Download the pmd-bin-X.Y.zip () from the SourceForge site, extract the pmd-X.Y.jar from it, and then run that on source code thusly: java -cp pmd-X.Y.jar net.sourceforge.pmd.cpd.CPD --minimum-tokens 100 --files /some/where/src --language c > cpd.txt You may run into memory limits, in which case you should use the -Xmx option: java -Xmx512M ... =head2 gcc warnings Though much can be written about the inconsistency and coverage problems of gcc warnings (like C<-Wall> not meaning "all the warnings", or some common portability problems not being covered by C<-Wall>, or C<-ansi> and C<-pedantic> both being a poorly defined collection of warnings, and so forth), gcc is still a useful tool in keeping our coding nose clean. The C<-Wall> is by default on. The C<-ansi> (and its sidekick, C<-pedantic>) would be nice to be on always, but unfortunately they are not safe on all platforms, they can for example cause fatal conflicts with the system headers (Solaris being a prime example). If Configure C<-Dgccansipedantic> is used, the C frontend selects C<-ansi -pedantic> for the platforms where they are known to be safe. Starting from Perl 5.9.4 the following extra flags are added: =over 4 =item * C<-Wendif-labels> =item * C<-Wextra> =item * C<-Wdeclaration-after-statement> =back The following flags would be nice to have but they would first need their own Augean stablemaster: =over 4 =item * C<-Wpointer-arith> =item * C<-Wshadow> =item * C<-Wstrict-prototypes> =back The C<-Wtraditional> is another example of the annoying tendency of gcc to bundle a lot of warnings under one switch (it would be impossible to deploy in practice because it would complain a lot) but it does contain some warnings that would be beneficial to have available on their own, such as the warning about string constants inside macros containing the macro arguments: this behaved differently pre-ANSI than it does in ANSI, and some C compilers are still in transition, AIX being an example. =head2 Warnings of other C compilers Other C compilers (yes, there B other C compilers than gcc) often have their "strict ANSI" or "strict ANSI with some portability extensions" modes on, like for example the Sun Workshop has its C<-Xa> mode on (though implicitly), or the DEC (these days, HP...) has its C<-std1> mode on. =head1 MEMORY DEBUGGERS B: Running under memory debuggers such as Purify, valgrind, or Third Degree greatly slows down the execution: seconds become minutes, minutes become hours. For example as of Perl 5.8.1, the ext/Encode/t/Unicode.t takes extraordinarily long to complete under e.g. Purify, Third Degree, and valgrind. Under valgrind it takes more than six hours, even on a snappy computer. The said test must be doing something that is quite unfriendly for memory debuggers. If you don't feel like waiting, that you can simply kill away the perl process. B: To minimize the number of memory leak false alarms (see L for more information), you have to set the environment variable PERL_DESTRUCT_LEVEL to 2. For csh-like shells: setenv PERL_DESTRUCT_LEVEL 2 For Bourne-type shells: PERL_DESTRUCT_LEVEL=2 export PERL_DESTRUCT_LEVEL In Unixy environments you can also use the C command: env PERL_DESTRUCT_LEVEL=2 valgrind ./perl -Ilib ... B: There are known memory leaks when there are compile-time errors within eval or require, seeing C in the call stack is a good sign of these. Fixing these leaks is non-trivial, unfortunately, but they must be fixed eventually. B: L will not clean up after itself completely unless Perl is built with the Configure option C<-Accflags=-DDL_UNLOAD_ALL_AT_EXIT>. =head2 Rational Software's Purify Purify is a commercial tool that is helpful in identifying memory overruns, wild pointers, memory leaks and other such badness. Perl must be compiled in a specific way for optimal testing with Purify. Purify is available under Windows NT, Solaris, HP-UX, SGI, and Siemens Unix. =head3 Purify on Unix On Unix, Purify creates a new Perl binary. To get the most benefit out of Purify, you should create the perl to Purify using: sh Configure -Accflags=-DPURIFY -Doptimize='-g' \ -Uusemymalloc -Dusemultiplicity where these arguments mean: =over 4 =item * -Accflags=-DPURIFY Disables Perl's arena memory allocation functions, as well as forcing use of memory allocation functions derived from the system malloc. =item * -Doptimize='-g' Adds debugging information so that you see the exact source statements where the problem occurs. Without this flag, all you will see is the source filename of where the error occurred. =item * -Uusemymalloc Disable Perl's malloc so that Purify can more closely monitor allocations and leaks. Using Perl's malloc will make Purify report most leaks in the "potential" leaks category. =item * -Dusemultiplicity Enabling the multiplicity option allows perl to clean up thoroughly when the interpreter shuts down, which reduces the number of bogus leak reports from Purify. =back Once you've compiled a perl suitable for Purify'ing, then you can just: make pureperl which creates a binary named 'pureperl' that has been Purify'ed. This binary is used in place of the standard 'perl' binary when you want to debug Perl memory problems. As an example, to show any memory leaks produced during the standard Perl testset you would create and run the Purify'ed perl as: make pureperl cd t ../pureperl -I../lib harness which would run Perl on test.pl and report any memory problems. Purify outputs messages in "Viewer" windows by default. If you don't have a windowing environment or if you simply want the Purify output to unobtrusively go to a log file instead of to the interactive window, use these following options to output to the log file "perl.log": setenv PURIFYOPTIONS "-chain-length=25 -windows=no \ -log-file=perl.log -append-logfile=yes" If you plan to use the "Viewer" windows, then you only need this option: setenv PURIFYOPTIONS "-chain-length=25" In Bourne-type shells: PURIFYOPTIONS="..." export PURIFYOPTIONS or if you have the "env" utility: env PURIFYOPTIONS="..." ../pureperl ... =head3 Purify on NT Purify on Windows NT instruments the Perl binary 'perl.exe' on the fly. There are several options in the makefile you should change to get the most use out of Purify: =over 4 =item * DEFINES You should add -DPURIFY to the DEFINES line so the DEFINES line looks something like: DEFINES = -DWIN32 -D_CONSOLE -DNO_STRICT $(CRYPT_FLAG) -DPURIFY=1 to disable Perl's arena memory allocation functions, as well as to force use of memory allocation functions derived from the system malloc. =item * USE_MULTI = define Enabling the multiplicity option allows perl to clean up thoroughly when the interpreter shuts down, which reduces the number of bogus leak reports from Purify. =item * #PERL_MALLOC = define Disable Perl's malloc so that Purify can more closely monitor allocations and leaks. Using Perl's malloc will make Purify report most leaks in the "potential" leaks category. =item * CFG = Debug Adds debugging information so that you see the exact source statements where the problem occurs. Without this flag, all you will see is the source filename of where the error occurred. =back As an example, to show any memory leaks produced during the standard Perl testset you would create and run Purify as: cd win32 make cd ../t purify ../perl -I../lib harness which would instrument Perl in memory, run Perl on test.pl, then finally report any memory problems. =head2 valgrind The excellent valgrind tool can be used to find out both memory leaks and illegal memory accesses. As of version 3.3.0, Valgrind only supports Linux on x86, x86-64 and PowerPC and Darwin (OS X) on x86 and x86-64). The special "test.valgrind" target can be used to run the tests under valgrind. Found errors and memory leaks are logged in files named F. Valgrind also provides a cachegrind tool, invoked on perl as: VG_OPTS=--tool=cachegrind make test.valgrind As system libraries (most notably glibc) are also triggering errors, valgrind allows to suppress such errors using suppression files. The default suppression file that comes with valgrind already catches a lot of them. Some additional suppressions are defined in F. To get valgrind and for more information see http://valgrind.org/ =head1 PROFILING Depending on your platform there are various ways of profiling Perl. There are two commonly used techniques of profiling executables: I and I. The first method takes periodically samples of the CPU program counter, and since the program counter can be correlated with the code generated for functions, we get a statistical view of in which functions the program is spending its time. The caveats are that very small/fast functions have lower probability of showing up in the profile, and that periodically interrupting the program (this is usually done rather frequently, in the scale of milliseconds) imposes an additional overhead that may skew the results. The first problem can be alleviated by running the code for longer (in general this is a good idea for profiling), the second problem is usually kept in guard by the profiling tools themselves. The second method divides up the generated code into I. Basic blocks are sections of code that are entered only in the beginning and exited only at the end. For example, a conditional jump starts a basic block. Basic block profiling usually works by I the code by adding I book-keeping code to the generated code. During the execution of the code the basic block counters are then updated appropriately. The caveat is that the added extra code can skew the results: again, the profiling tools usually try to factor their own effects out of the results. =head2 Gprof Profiling gprof is a profiling tool available in many Unix platforms, it uses F. You can build a profiled version of perl called "perl.gprof" by invoking the make target "perl.gprof" (What is required is that Perl must be compiled using the C<-pg> flag, you may need to re-Configure). Running the profiled version of Perl will create an output file called F is created which contains the profiling data collected during the execution. The gprof tool can then display the collected data in various ways. Usually gprof understands the following options: =over 4 =item * -a Suppress statically defined functions from the profile. =item * -b Suppress the verbose descriptions in the profile. =item * -e routine Exclude the given routine and its descendants from the profile. =item * -f routine Display only the given routine and its descendants in the profile. =item * -s Generate a summary file called F which then may be given to subsequent gprof runs to accumulate data over several runs. =item * -z Display routines that have zero usage. =back For more detailed explanation of the available commands and output formats, see your own local documentation of gprof. quick hint: $ sh Configure -des -Dusedevel -Doptimize='-pg' && make perl.gprof $ ./perl.gprof someprog # creates gmon.out in current directory $ gprof ./perl.gprof > out $ view out =head2 GCC gcov Profiling Starting from GCC 3.0 I is officially available for the GNU CC. You can build a profiled version of perl called F by invoking the make target "perl.gcov" (what is required that Perl must be compiled using gcc with the flags C<-fprofile-arcs -ftest-coverage>, you may need to re-Configure). Running the profiled version of Perl will cause profile output to be generated. For each source file an accompanying ".da" file will be created. To display the results you use the "gcov" utility (which should be installed if you have gcc 3.0 or newer installed). F is run on source code files, like this gcov sv.c which will cause F to be created. The F<.gcov> files contain the source code annotated with relative frequencies of execution indicated by "#" markers. Useful options of F include C<-b> which will summarise the basic block, branch, and function call coverage, and C<-c> which instead of relative frequencies will use the actual counts. For more information on the use of F and basic block profiling with gcc, see the latest GNU CC manual, as of GCC 3.0 see http://gcc.gnu.org/onlinedocs/gcc-3.0/gcc.html and its section titled "8. gcov: a Test Coverage Program" http://gcc.gnu.org/onlinedocs/gcc-3.0/gcc_8.html#SEC132 quick hint: $ sh Configure -des -Dusedevel -Doptimize='-g' \ -Accflags='-fprofile-arcs -ftest-coverage' \ -Aldflags='-fprofile-arcs -ftest-coverage' && make perl.gcov $ rm -f regexec.c.gcov regexec.gcda $ ./perl.gcov $ gcov regexec.c $ view regexec.c.gcov =head1 MISCELLANEOUS TRICKS =head2 PERL_DESTRUCT_LEVEL If you want to run any of the tests yourself manually using e.g. valgrind, or the pureperl or perl.third executables, please note that by default perl B explicitly cleanup all the memory it has allocated (such as global memory arenas) but instead lets the exit() of the whole program "take care" of such allocations, also known as "global destruction of objects". There is a way to tell perl to do complete cleanup: set the environment variable PERL_DESTRUCT_LEVEL to a non-zero value. The t/TEST wrapper does set this to 2, and this is what you need to do too, if you don't want to see the "global leaks": For example, for "third-degreed" Perl: env PERL_DESTRUCT_LEVEL=2 ./perl.third -Ilib t/foo/bar.t (Note: the mod_perl apache module uses also this environment variable for its own purposes and extended its semantics. Refer to the mod_perl documentation for more information. Also, spawned threads do the equivalent of setting this variable to the value 1.) If, at the end of a run you get the message I, you can recompile with C<-DDEBUG_LEAKING_SCALARS>, which will cause the addresses of all those leaked SVs to be dumped along with details as to where each SV was originally allocated. This information is also displayed by Devel::Peek. Note that the extra details recorded with each SV increases memory usage, so it shouldn't be used in production environments. It also converts C from a macro into a real function, so you can use your favourite debugger to discover where those pesky SVs were allocated. If you see that you're leaking memory at runtime, but neither valgrind nor C<-DDEBUG_LEAKING_SCALARS> will find anything, you're probably leaking SVs that are still reachable and will be properly cleaned up during destruction of the interpreter. In such cases, using the C<-Dm> switch can point you to the source of the leak. If the executable was built with C<-DDEBUG_LEAKING_SCALARS>, C<-Dm> will output SV allocations in addition to memory allocations. Each SV allocation has a distinct serial number that will be written on creation and destruction of the SV. So if you're executing the leaking code in a loop, you need to look for SVs that are created, but never destroyed between each cycle. If such an SV is found, set a conditional breakpoint within C and make it break only when C is equal to the serial number of the leaking SV. Then you will catch the interpreter in exactly the state where the leaking SV is allocated, which is sufficient in many cases to find the source of the leak. As C<-Dm> is using the PerlIO layer for output, it will by itself allocate quite a bunch of SVs, which are hidden to avoid recursion. You can bypass the PerlIO layer if you use the SV logging provided by C<-DPERL_MEM_LOG> instead. =head2 PERL_MEM_LOG If compiled with C<-DPERL_MEM_LOG>, both memory and SV allocations go through logging functions, which is handy for breakpoint setting. Unless C<-DPERL_MEM_LOG_NOIMPL> is also compiled, the logging functions read $ENV{PERL_MEM_LOG} to determine whether to log the event, and if so how: $ENV{PERL_MEM_LOG} =~ /m/ Log all memory ops $ENV{PERL_MEM_LOG} =~ /s/ Log all SV ops $ENV{PERL_MEM_LOG} =~ /t/ include timestamp in Log $ENV{PERL_MEM_LOG} =~ /^(\d+)/ write to FD given (default is 2) Memory logging is somewhat similar to C<-Dm> but is independent of C<-DDEBUGGING>, and at a higher level; all uses of Newx(), Renew(), and Safefree() are logged with the caller's source code file and line number (and C function name, if supported by the C compiler). In contrast, C<-Dm> is directly at the point of C. SV logging is similar. Since the logging doesn't use PerlIO, all SV allocations are logged and no extra SV allocations are introduced by enabling the logging. If compiled with C<-DDEBUG_LEAKING_SCALARS>, the serial number for each SV allocation is also logged. =head2 DDD over gdb Those debugging perl with the DDD frontend over gdb may find the following useful: You can extend the data conversion shortcuts menu, so for example you can display an SV's IV value with one click, without doing any typing. To do that simply edit ~/.ddd/init file and add after: ! Display shortcuts. Ddd*gdbDisplayShortcuts: \ /t () // Convert to Bin\n\ /d () // Convert to Dec\n\ /x () // Convert to Hex\n\ /o () // Convert to Oct(\n\ the following two lines: ((XPV*) (())->sv_any )->xpv_pv // 2pvx\n\ ((XPVIV*) (())->sv_any )->xiv_iv // 2ivx so now you can do ivx and pvx lookups or you can plug there the sv_peek "conversion": Perl_sv_peek(my_perl, (SV*)()) // sv_peek (The my_perl is for threaded builds.) Just remember that every line, but the last one, should end with \n\ Alternatively edit the init file interactively via: 3rd mouse button -> New Display -> Edit Menu Note: you can define up to 20 conversion shortcuts in the gdb section. =head2 Poison If you see in a debugger a memory area mysteriously full of 0xABABABAB or 0xEFEFEFEF, you may be seeing the effect of the Poison() macros, see L. =head2 Read-only optrees Under ithreads the optree is read only. If you want to enforce this, to check for write accesses from buggy code, compile with C<-DPL_OP_SLAB_ALLOC> to enable the OP slab allocator and C<-DPERL_DEBUG_READONLY_OPS> to enable code that allocates op memory via C, and sets it read-only at run time. Any write access to an op results in a C and abort. This code is intended for development only, and may not be portable even to all Unix variants. Also, it is an 80% solution, in that it isn't able to make all ops read only. Specifically it =over =item * 1 Only sets read-only on all slabs of ops at C time, hence ops allocated later via C or C will be re-write =item * 2 Turns an entire slab of ops read-write if the refcount of any op in the slab needs to be decreased. =item * 3 Turns an entire slab of ops read-write if any op from the slab is freed. =back It's not possible to turn the slabs to read-only after an action requiring read-write access, as either can happen during op tree building time, so there may still be legitimate write access. However, as an 80% solution it is still effective, as currently it catches a write access during the generation of F, which means that we can't yet build F with this enabled. =head2 The .i Targets You can expand the macros in a F file by saying make foo.i which will expand the macros using cpp. Don't be scared by the results. =head1 AUTHOR This document was originally written by Nathan Torkington, and is maintained by the perl5-porters mailing list. perl5121delta.pod000064400000023637147633747270007574 0ustar00=encoding utf8 =head1 NAME perl5121delta - what is new for perl v5.12.1 =head1 DESCRIPTION This document describes differences between the 5.12.0 release and the 5.12.1 release. If you are upgrading from an earlier release such as 5.10.1, first read L, which describes differences between 5.10.1 and 5.12.0. =head1 Incompatible Changes There are no changes intentionally incompatible with 5.12.0. If any incompatibilities with 5.12.0 exist, they are bugs. Please report them. =head1 Core Enhancements Other than the bug fixes listed below, there should be no user-visible changes to the core language in this release. =head1 Modules and Pragmata =head2 Pragmata Changes =over =item * We fixed exporting of C and C from L. These were being exported with a wrapper that treated them as method calls, which caused them to fail. They are just functions, are documented as such, and should never be subclassed, so this patch just exports them directly as functions without the wrapper. =back =head2 Updated Modules =over =item * We upgraded L to version 3.49 to incorporate fixes for regressions introduced in the release we shipped with Perl 5.12.0. =item * We upgraded L to version 3.14 to get an improvement to \C\<\< \>\> parsing. =item * We made a small fix to the L test suite to fix an occasional spurious test failure. =item * We upgraded L to version 2.27 to wrap coderefs returned by C and C. =back =head1 Changes to Existing Documentation =over =item * We added the new maintenance release policy to L =item * We've clarified the multiple-angle-bracket construct in the spec for POD in L =item * We added a missing explanation for a warning about C<:=> to L =item * We removed a false claim in L that all text strings are Unicode strings in Perl. =item * We updated the Github mirror link in L to mirrors/perl, not github/perl =item * We fixed a a minor error in L. =item * We replaced a mention of the now-obsolete L with F/F. =item * We improved documentation about F<$sitelibexp/sitecustomize.pl> in L. =item * We corrected L which had unintentionally omitted a number of modules. =item * We updated the documentation for 'require' in L relating to putting Perl code in @INC. =item * We reinstated some erroneously-removed documentation about quotemeta in L. =item * We fixed an F example in L. =item * We filled in a blank in L with the release date of Perl 5.12. =item * We fixed broken links in a number of perldelta files. =item * The documentation for L incorrectly stated that the $Carp::Verbose variable makes cluck generate stack backtraces. =item * We fixed a number of typos in L =item * We improved documentation of case-changing functions in L =item * We corrected L to contain the correct version of the GNU General Public License. =back =head1 Testing =head2 Testing Improvements =over =item * F is now less prone to clock jitter during timing checks on Windows. sleep() time on Win32 may be rounded down to multiple of the clock tick interval. =item * F and F: Fixes for test failures on Darwin/PPC =item * F: Fix for test failures when C is installed. =back =head1 Installation and Configuration Improvements =head2 Configuration improvements =over =item * We updated F with notes about how to deal with broken F on OpenSUSE (and possibly other platforms) =back =head1 Bug Fixes =over 4 =item * A bug in how we process filetest operations could cause a segfault. Filetests don't always expect an op on the stack, so we now use TOPs only if we're sure that we're not stat'ing the _ filehandle. This is indicated by OPf_KIDS (as checked in ck_ftst). See also: L =item * When deparsing a nextstate op that has both a change of package (relative to the previous nextstate) and a label, the package declaration is now emitted first, because it is syntactically impermissible for a label to prefix a package declaration. =item * XSUB.h now correctly redefines fgets under PERL_IMPLICIT_SYS See also: L =item * utf8::is_utf8 now respects GMAGIC (e.g. $1) =item * XS code using C or C: on Windows could cause an error due to their arguments being swapped. See also: L =item * We fixed a small bug in lex_stuff_pvn() that caused spurious syntax errors in an obscure situation. It happened when stuffing was performed on the last line of a file and the line ended with a statement that lacked a terminating semicolon. See also: L =item * We fixed a bug that could cause \N{} constructs followed by a single . to be parsed incorrectly. See also: L =item * We fixed a bug that caused when(scalar) without an argument not to be treated as a syntax error. See also: L =item * We fixed a regression in the handling of labels immediately before string evals that was introduced in Perl 5.12.0. See also: L =item * We fixed a regression in case-insensitive matching of folded characters in regular expressions introduced in Perl 5.10.1. See also: L =back =head1 Platform Specific Notes =head2 HP-UX =over =item * Perl now allows -Duse64bitint without promoting to use64bitall on HP-UX =back =head2 AIX =over =item * Perl now builds on AIX 4.2 The changes required work around AIX 4.2s' lack of support for IPv6, and limited support for POSIX C. =back =head2 FreeBSD 7 =over =item * FreeBSD 7 no longer contains F. At build time, Perl now skips the F check for versions 7 and higher and assumes ELF. =back =head2 VMS =over =item * It's now possible to build extensions on older (pre 7.3-2) VMS systems. DCL symbol length was limited to 1K up until about seven years or so ago, but there was no particularly deep reason to prevent those older systems from configuring and building Perl. =item * We fixed the previously-broken C<-Uuseperlio> build on VMS. We were checking a variable that doesn't exist in the non-default case of disabling perlio. Now we only look at it when it exists. =item * We fixed the -Uuseperlio command-line option in configure.com. Formerly it only worked if you went through all the questions interactively and explicitly answered no. =back =head1 Known Problems =over =item * C misbehaves in the presence of a lexical C<$_> (typically introduced by C or implicitly by C). The variable which gets set for each iteration is the package variable C<$_>, not the lexical C<$_>. A similar issue may occur in other modules that provide functions which take a block as their first argument, like foo { ... $_ ...} list See also: L =item * C and C have an unfortunate interaction which can cause C to crash when it encounters an unparseable version string. Upgrading to C 0.9004 or C 0.38 from CPAN will resolve this issue. =back =head1 Acknowledgements Perl 5.12.1 represents approximately four weeks of development since Perl 5.12.0 and contains approximately 4,000 lines of changes across 142 files from 28 authors. Perl continues to flourish into its third decade thanks to a vibrant community of users and developers. The following people are known to have contributed the improvements that became Perl 5.12.1: Ævar Arnfjörð Bjarmason, Chris Williams, chromatic, Craig A. Berry, David Golden, Father Chrysostomos, Florian Ragwitz, Frank Wiegand, Gene Sullivan, Goro Fuji, H.Merijn Brand, James E Keenan, Jan Dubois, Jesse Vincent, Josh ben Jore, Karl Williamson, Leon Brocard, Michael Schwern, Nga Tang Chan, Nicholas Clark, Niko Tyni, Philippe Bruhat, Rafael Garcia-Suarez, Ricardo Signes, Steffen Mueller, Todd Rinaldo, Vincent Pit and Zefram. =head1 Reporting Bugs If you find what you think is a bug, you might check the articles recently posted to the comp.lang.perl.misc newsgroup and the perl bug database at http://rt.perl.org/perlbug/ . There may also be information at http://www.perl.org/ , the Perl Home Page. If you believe you have an unreported bug, please run the B program included with your release. Be sure to trim your bug down to a tiny but sufficient test case. Your bug report, along with the output of C, will be sent off to perlbug@perl.org to be analysed by the Perl porting team. If the bug you are reporting has security implications, which make it inappropriate to send to a publicly archived mailing list, then please send it to perl5-security-report@perl.org. This points to a closed subscription unarchived mailing list, which includes all the core committers, who will be able to help assess the impact of issues, figure out a resolution, and help co-ordinate the release of patches to mitigate or fix the problem across all platforms on which Perl is supported. Please only use this address for security issues in the Perl core, not for modules independently distributed on CPAN. =head1 SEE ALSO The F file for an explanation of how to view exhaustive details on what changed. The F file for how to build Perl. The F file for general stuff. The F and F files for copyright information. =cut perlriscos.pod000064400000002771147633747270007470 0ustar00If you read this file _as_is_, just ignore the funny characters you see. It is written in the POD format (see pod/perlpod.pod) which is specifically designed to be readable as is. =head1 NAME perlriscos - Perl version 5 for RISC OS =head1 DESCRIPTION This document gives instructions for building Perl for RISC OS. It is complicated by the need to cross compile. There is a binary version of perl available from L which you may wish to use instead of trying to compile it yourself. =head1 BUILD You need an installed and working gccsdk cross compiler L and REXEN L Firstly, copy the source and build a native copy of perl for your host system. Then, in the source to be cross compiled: =over 4 =item 1. $ ./Configure =item 2. Select the riscos hint file. The default answers for the rest of the questions are usually sufficient. Note that, if you wish to run Configure non-interactively (see the INSTALL document for details), to have it select the correct hint file, you'll need to provide the argument -Dhintfile=riscos on the Configure command-line. =item 3. $ make miniperl =item 4. This should build miniperl and then fail when it tries to run it. =item 5. Copy the miniperl executable from the native build done earlier to replace the cross compiled miniperl. =item 6. $ make =item 7. This will use miniperl to complete the rest of the build. =back =head1 AUTHOR Alex Waugh perlnumber.pod000064400000020240147633747270007445 0ustar00=head1 NAME perlnumber - semantics of numbers and numeric operations in Perl =head1 SYNOPSIS $n = 1234; # decimal integer $n = 0b1110011; # binary integer $n = 01234; # octal integer $n = 0x1234; # hexadecimal integer $n = 12.34e-56; # exponential notation $n = "-12.34e56"; # number specified as a string $n = "1234"; # number specified as a string =head1 DESCRIPTION This document describes how Perl internally handles numeric values. Perl's operator overloading facility is completely ignored here. Operator overloading allows user-defined behaviors for numbers, such as operations over arbitrarily large integers, floating points numbers with arbitrary precision, operations over "exotic" numbers such as modular arithmetic or p-adic arithmetic, and so on. See L for details. =head1 Storing numbers Perl can internally represent numbers in 3 different ways: as native integers, as native floating point numbers, and as decimal strings. Decimal strings may have an exponential notation part, as in C<"12.34e-56">. I here means "a format supported by the C compiler which was used to build perl". The term "native" does not mean quite as much when we talk about native integers, as it does when native floating point numbers are involved. The only implication of the term "native" on integers is that the limits for the maximal and the minimal supported true integral quantities are close to powers of 2. However, "native" floats have a most fundamental restriction: they may represent only those numbers which have a relatively "short" representation when converted to a binary fraction. For example, 0.9 cannot be represented by a native float, since the binary fraction for 0.9 is infinite: binary0.1110011001100... with the sequence C<1100> repeating again and again. In addition to this limitation, the exponent of the binary number is also restricted when it is represented as a floating point number. On typical hardware, floating point values can store numbers with up to 53 binary digits, and with binary exponents between -1024 and 1024. In decimal representation this is close to 16 decimal digits and decimal exponents in the range of -304..304. The upshot of all this is that Perl cannot store a number like 12345678901234567 as a floating point number on such architectures without loss of information. Similarly, decimal strings can represent only those numbers which have a finite decimal expansion. Being strings, and thus of arbitrary length, there is no practical limit for the exponent or number of decimal digits for these numbers. (But realize that what we are discussing the rules for just the I of these numbers. The fact that you can store such "large" numbers does not mean that the I over these numbers will use all of the significant digits. See L<"Numeric operators and numeric conversions"> for details.) In fact numbers stored in the native integer format may be stored either in the signed native form, or in the unsigned native form. Thus the limits for Perl numbers stored as native integers would typically be -2**31..2**32-1, with appropriate modifications in the case of 64-bit integers. Again, this does not mean that Perl can do operations only over integers in this range: it is possible to store many more integers in floating point format. Summing up, Perl numeric values can store only those numbers which have a finite decimal expansion or a "short" binary expansion. =head1 Numeric operators and numeric conversions As mentioned earlier, Perl can store a number in any one of three formats, but most operators typically understand only one of those formats. When a numeric value is passed as an argument to such an operator, it will be converted to the format understood by the operator. Six such conversions are possible: native integer --> native floating point (*) native integer --> decimal string native floating_point --> native integer (*) native floating_point --> decimal string (*) decimal string --> native integer decimal string --> native floating point (*) These conversions are governed by the following general rules: =over 4 =item * If the source number can be represented in the target form, that representation is used. =item * If the source number is outside of the limits representable in the target form, a representation of the closest limit is used. (I) =item * If the source number is between two numbers representable in the target form, a representation of one of these numbers is used. (I) =item * In C<< native floating point --> native integer >> conversions the magnitude of the result is less than or equal to the magnitude of the source. (I<"Rounding to zero".>) =item * If the C<< decimal string --> native integer >> conversion cannot be done without loss of information, the result is compatible with the conversion sequence C<< decimal_string --> native_floating_point --> native_integer >>. In particular, rounding is strongly biased to 0, though a number like C<"0.99999999999999999999"> has a chance of being rounded to 1. =back B: The conversions marked with C<(*)> above involve steps performed by the C compiler. In particular, bugs/features of the compiler used may lead to breakage of some of the above rules. =head1 Flavors of Perl numeric operations Perl operations which take a numeric argument treat that argument in one of four different ways: they may force it to one of the integer/floating/ string formats, or they may behave differently depending on the format of the operand. Forcing a numeric value to a particular format does not change the number stored in the value. All the operators which need an argument in the integer format treat the argument as in modular arithmetic, e.g., C on a 32-bit architecture. C therefore provides the same result as C. =over 4 =item Arithmetic operators The binary operators C<+> C<-> C<*> C C<%> C<==> C C> C> C=> C=> and the unary operators C<-> C and C<--> will attempt to convert arguments to integers. If both conversions are possible without loss of precision, and the operation can be performed without loss of precision then the integer result is used. Otherwise arguments are converted to floating point format and the floating point result is used. The caching of conversions (as described above) means that the integer conversion does not throw away fractional parts on floating point numbers. =item ++ C<++> behaves as the other operators above, except that if it is a string matching the format C the string increment described in L is used. =item Arithmetic operators during C In scopes where C is in force, nearly all the operators listed above will force their argument(s) into integer format, and return an integer result. The exceptions, C, C<++> and C<-->, do not change their behavior with C =item Other mathematical operators Operators such as C<**>, C and C force arguments to floating point format. =item Bitwise operators Arguments are forced into the integer format if not strings. =item Bitwise operators during C forces arguments to integer format. Also shift operations internally use signed integers rather than the default unsigned. =item Operators which expect an integer force the argument into the integer format. This is applicable to the third and fourth arguments of C, for example. =item Operators which expect a string force the argument into the string format. For example, this is applicable to C. =back Though forcing an argument into a particular form does not change the stored number, Perl remembers the result of such conversions. In particular, though the first such conversion may be time-consuming, repeated operations will not need to redo the conversion. =head1 AUTHOR Ilya Zakharevich C Editorial adjustments by Gurusamy Sarathy Updates for 5.8.0 by Nicholas Clark =head1 SEE ALSO L, L perlopentut.pod000064400000113042147633747270007656 0ustar00=head1 NAME perlopentut - tutorial on opening things in Perl =head1 DESCRIPTION Perl has two simple, built-in ways to open files: the shell way for convenience, and the C way for precision. The shell way also has 2- and 3-argument forms, which have different semantics for handling the filename. The choice is yours. =head1 Open E la shell Perl's C function was designed to mimic the way command-line redirection in the shell works. Here are some basic examples from the shell: $ myprogram file1 file2 file3 $ myprogram < inputfile $ myprogram > outputfile $ myprogram >> outputfile $ myprogram | otherprogram $ otherprogram | myprogram And here are some more advanced examples: $ otherprogram | myprogram f1 - f2 $ otherprogram 2>&1 | myprogram - $ myprogram <&3 $ myprogram >&4 Programmers accustomed to constructs like those above can take comfort in learning that Perl directly supports these familiar constructs using virtually the same syntax as the shell. =head2 Simple Opens The C function takes two arguments: the first is a filehandle, and the second is a single string comprising both what to open and how to open it. C returns true when it works, and when it fails, returns a false value and sets the special variable C<$!> to reflect the system error. If the filehandle was previously opened, it will be implicitly closed first. For example: open(INFO, "datafile") || die("can't open datafile: $!"); open(INFO, "< datafile") || die("can't open datafile: $!"); open(RESULTS,"> runstats") || die("can't open runstats: $!"); open(LOG, ">> logfile ") || die("can't open logfile: $!"); If you prefer the low-punctuation version, you could write that this way: open INFO, "< datafile" or die "can't open datafile: $!"; open RESULTS,"> runstats" or die "can't open runstats: $!"; open LOG, ">> logfile " or die "can't open logfile: $!"; A few things to notice. First, the leading C<< < >> is optional. If omitted, Perl assumes that you want to open the file for reading. Note also that the first example uses the C<||> logical operator, and the second uses C, which has lower precedence. Using C<||> in the latter examples would effectively mean open INFO, ( "< datafile" || die "can't open datafile: $!" ); which is definitely not what you want. The other important thing to notice is that, just as in the shell, any whitespace before or after the filename is ignored. This is good, because you wouldn't want these to do different things: open INFO, "; # oops, \n still there open(EXTRA, "< $filename") || die "can't open $filename: $!"; This is not a bug, but a feature. Because C mimics the shell in its style of using redirection arrows to specify how to open the file, it also does so with respect to extra whitespace around the filename itself as well. For accessing files with naughty names, see L<"Dispelling the Dweomer">. There is also a 3-argument version of C, which lets you put the special redirection characters into their own argument: open( INFO, ">", $datafile ) || die "Can't create $datafile: $!"; In this case, the filename to open is the actual string in C<$datafile>, so you don't have to worry about C<$datafile> containing characters that might influence the open mode, or whitespace at the beginning of the filename that would be absorbed in the 2-argument version. Also, any reduction of unnecessary string interpolation is a good thing. =head2 Indirect Filehandles C's first argument can be a reference to a filehandle. As of perl 5.6.0, if the argument is uninitialized, Perl will automatically create a filehandle and put a reference to it in the first argument, like so: open( my $in, $infile ) or die "Couldn't read $infile: $!"; while ( <$in> ) { # do something with $_ } close $in; Indirect filehandles make namespace management easier. Since filehandles are global to the current package, two subroutines trying to open C will clash. With two functions opening indirect filehandles like C, there's no clash and no need to worry about future conflicts. Another convenient behavior is that an indirect filehandle automatically closes when there are no more references to it: sub firstline { open( my $in, shift ) && return scalar <$in>; # no close() required } Indirect filehandles also make it easy to pass filehandles to and return filehandles from subroutines: for my $file ( qw(this.conf that.conf) ) { my $fin = open_or_throw('<', $file); process_conf( $fin ); # no close() needed } use Carp; sub open_or_throw { my ($mode, $filename) = @_; open my $h, $mode, $filename or croak "Could not open '$filename': $!"; return $h; } =head2 Pipe Opens In C, when you want to open a file using the standard I/O library, you use the C function, but when opening a pipe, you use the C function. But in the shell, you just use a different redirection character. That's also the case for Perl. The C call remains the same--just its argument differs. If the leading character is a pipe symbol, C starts up a new command and opens a write-only filehandle leading into that command. This lets you write into that handle and have what you write show up on that command's standard input. For example: open(PRINTER, "| lpr -Plp1") || die "can't run lpr: $!"; print PRINTER "stuff\n"; close(PRINTER) || die "can't close lpr: $!"; If the trailing character is a pipe, you start up a new command and open a read-only filehandle leading out of that command. This lets whatever that command writes to its standard output show up on your handle for reading. For example: open(NET, "netstat -i -n |") || die "can't fork netstat: $!"; while () { } # do something with input close(NET) || die "can't close netstat: $!"; What happens if you try to open a pipe to or from a non-existent command? If possible, Perl will detect the failure and set C<$!> as usual. But if the command contains special shell characters, such as C> or C<*>, called 'metacharacters', Perl does not execute the command directly. Instead, Perl runs the shell, which then tries to run the command. This means that it's the shell that gets the error indication. In such a case, the C call will only indicate failure if Perl can't even run the shell. See L to see how to cope with this. There's also an explanation in L. If you would like to open a bidirectional pipe, the IPC::Open2 library will handle this for you. Check out L perl-5.6.x introduced a version of piped open that executes a process based on its command line arguments without relying on the shell. (Similar to the C notation.) This is safer and faster than executing a single argument pipe-command, but does not allow special shell constructs. (It is also not supported on Microsoft Windows, Mac OS Classic or RISC OS.) Here's an example of C, which prints a random Unix fortune cookie as uppercase: my $collection = shift(@ARGV); open my $fortune, '-|', 'fortune', $collection or die "Could not find fortune - $!"; while (<$fortune>) { print uc($_); } close($fortune); And this C pipes into lpr: open my $printer, '|-', 'lpr', '-Plp1' or die "can't run lpr: $!"; print {$printer} "stuff\n"; close($printer) or die "can't close lpr: $!"; =head2 The Minus File Again following the lead of the standard shell utilities, Perl's C function treats a file whose name is a single minus, "-", in a special way. If you open minus for reading, it really means to access the standard input. If you open minus for writing, it really means to access the standard output. If minus can be used as the default input or default output, what happens if you open a pipe into or out of minus? What's the default command it would run? The same script as you're currently running! This is actually a stealth C hidden inside an C call. See L for details. =head2 Mixing Reads and Writes It is possible to specify both read and write access. All you do is add a "+" symbol in front of the redirection. But as in the shell, using a less-than on a file never creates a new file; it only opens an existing one. On the other hand, using a greater-than always clobbers (truncates to zero length) an existing file, or creates a brand-new one if there isn't an old one. Adding a "+" for read-write doesn't affect whether it only works on existing files or always clobbers existing ones. open(WTMP, "+< /usr/adm/wtmp") || die "can't open /usr/adm/wtmp: $!"; open(SCREEN, "+> lkscreen") || die "can't open lkscreen: $!"; open(LOGFILE, "+>> /var/log/applog") || die "can't open /var/log/applog: $!"; The first one won't create a new file, and the second one will always clobber an old one. The third one will create a new file if necessary and not clobber an old one, and it will allow you to read at any point in the file, but all writes will always go to the end. In short, the first case is substantially more common than the second and third cases, which are almost always wrong. (If you know C, the plus in Perl's C is historically derived from the one in C's fopen(3S), which it ultimately calls.) In fact, when it comes to updating a file, unless you're working on a binary file as in the WTMP case above, you probably don't want to use this approach for updating. Instead, Perl's B<-i> flag comes to the rescue. The following command takes all the C, C++, or yacc source or header files and changes all their foo's to bar's, leaving the old version in the original filename with a ".orig" tacked on the end: $ perl -i.orig -pe 's/\bfoo\b/bar/g' *.[Cchy] This is a short cut for some renaming games that are really the best way to update textfiles. See the second question in L for more details. =head2 Filters One of the most common uses for C is one you never even notice. When you process the ARGV filehandle using C<< >>, Perl actually does an implicit open on each file in @ARGV. Thus a program called like this: $ myprogram file1 file2 file3 can have all its files opened and processed one at a time using a construct no more complex than: while (<>) { # do something with $_ } If @ARGV is empty when the loop first begins, Perl pretends you've opened up minus, that is, the standard input. In fact, $ARGV, the currently open file during C<< >> processing, is even set to "-" in these circumstances. You are welcome to pre-process your @ARGV before starting the loop to make sure it's to your liking. One reason to do this might be to remove command options beginning with a minus. While you can always roll the simple ones by hand, the Getopts modules are good for this: use Getopt::Std; # -v, -D, -o ARG, sets $opt_v, $opt_D, $opt_o getopts("vDo:"); # -v, -D, -o ARG, sets $args{v}, $args{D}, $args{o} getopts("vDo:", \%args); Or the standard Getopt::Long module to permit named arguments: use Getopt::Long; GetOptions( "verbose" => \$verbose, # --verbose "Debug" => \$debug, # --Debug "output=s" => \$output ); # --output=somestring or --output somestring Another reason for preprocessing arguments is to make an empty argument list default to all files: @ARGV = glob("*") unless @ARGV; You could even filter out all but plain, text files. This is a bit silent, of course, and you might prefer to mention them on the way. @ARGV = grep { -f && -T } @ARGV; If you're using the B<-n> or B<-p> command-line options, you should put changes to @ARGV in a C block. Remember that a normal C has special properties, in that it might call fopen(3S) or it might called popen(3S), depending on what its argument looks like; that's why it's sometimes called "magic open". Here's an example: $pwdinfo = `domainname` =~ /^(\(none\))?$/ ? '< /etc/passwd' : 'ypcat passwd |'; open(PWD, $pwdinfo) or die "can't open $pwdinfo: $!"; This sort of thing also comes into play in filter processing. Because C<< >> processing employs the normal, shell-style Perl C, it respects all the special things we've already seen: $ myprogram f1 "cmd1|" - f2 "cmd2|" f3 < tmpfile That program will read from the file F, the process F, standard input (F in this case), the F file, the F command, and finally the F file. Yes, this also means that if you have files named "-" (and so on) in your directory, they won't be processed as literal files by C. You'll need to pass them as "./-", much as you would for the I program, or you could use C as described below. One of the more interesting applications is to change files of a certain name into pipes. For example, to autoprocess gzipped or compressed files by decompressing them with I: @ARGV = map { /\.(gz|Z)$/ ? "gzip -dc $_ |" : $_ } @ARGV; Or, if you have the I program installed from LWP, you can fetch URLs before processing them: @ARGV = map { m#^\w+://# ? "GET $_ |" : $_ } @ARGV; It's not for nothing that this is called magic C<< >>. Pretty nifty, eh? =head1 Open E la C If you want the convenience of the shell, then Perl's C is definitely the way to go. On the other hand, if you want finer precision than C's simplistic fopen(3S) provides you should look to Perl's C, which is a direct hook into the open(2) system call. That does mean it's a bit more involved, but that's the price of precision. C takes 3 (or 4) arguments. sysopen HANDLE, PATH, FLAGS, [MASK] The HANDLE argument is a filehandle just as with C. The PATH is a literal path, one that doesn't pay attention to any greater-thans or less-thans or pipes or minuses, nor ignore whitespace. If it's there, it's part of the path. The FLAGS argument contains one or more values derived from the Fcntl module that have been or'd together using the bitwise "|" operator. The final argument, the MASK, is optional; if present, it is combined with the user's current umask for the creation mode of the file. You should usually omit this. Although the traditional values of read-only, write-only, and read-write are 0, 1, and 2 respectively, this is known not to hold true on some systems. Instead, it's best to load in the appropriate constants first from the Fcntl module, which supplies the following standard flags: O_RDONLY Read only O_WRONLY Write only O_RDWR Read and write O_CREAT Create the file if it doesn't exist O_EXCL Fail if the file already exists O_APPEND Append to the file O_TRUNC Truncate the file O_NONBLOCK Non-blocking access Less common flags that are sometimes available on some operating systems include C, C, C, C, C, C, C, C, C, C, C and C. Consult your open(2) manpage or its local equivalent for details. (Note: starting from Perl release 5.6 the C flag, if available, is automatically added to the sysopen() flags because large files are the default.) Here's how to use C to emulate the simple C calls we had before. We'll omit the C<|| die $!> checks for clarity, but make sure you always check the return values in real code. These aren't quite the same, since C will trim leading and trailing whitespace, but you'll get the idea. To open a file for reading: open(FH, "< $path"); sysopen(FH, $path, O_RDONLY); To open a file for writing, creating a new file if needed or else truncating an old file: open(FH, "> $path"); sysopen(FH, $path, O_WRONLY | O_TRUNC | O_CREAT); To open a file for appending, creating one if necessary: open(FH, ">> $path"); sysopen(FH, $path, O_WRONLY | O_APPEND | O_CREAT); To open a file for update, where the file must already exist: open(FH, "+< $path"); sysopen(FH, $path, O_RDWR); And here are things you can do with C that you cannot do with a regular C. As you'll see, it's just a matter of controlling the flags in the third argument. To open a file for writing, creating a new file which must not previously exist: sysopen(FH, $path, O_WRONLY | O_EXCL | O_CREAT); To open a file for appending, where that file must already exist: sysopen(FH, $path, O_WRONLY | O_APPEND); To open a file for update, creating a new file if necessary: sysopen(FH, $path, O_RDWR | O_CREAT); To open a file for update, where that file must not already exist: sysopen(FH, $path, O_RDWR | O_EXCL | O_CREAT); To open a file without blocking, creating one if necessary: sysopen(FH, $path, O_WRONLY | O_NONBLOCK | O_CREAT); =head2 Permissions E la mode If you omit the MASK argument to C, Perl uses the octal value 0666. The normal MASK to use for executables and directories should be 0777, and for anything else, 0666. Why so permissive? Well, it isn't really. The MASK will be modified by your process's current C. A umask is a number representing I permissions bits; that is, bits that will not be turned on in the created file's permissions field. For example, if your C were 027, then the 020 part would disable the group from writing, and the 007 part would disable others from reading, writing, or executing. Under these conditions, passing C 0666 would create a file with mode 0640, since C<0666 & ~027> is 0640. You should seldom use the MASK argument to C. That takes away the user's freedom to choose what permission new files will have. Denying choice is almost always a bad thing. One exception would be for cases where sensitive or private data is being stored, such as with mail folders, cookie files, and internal temporary files. =head1 Obscure Open Tricks =head2 Re-Opening Files (dups) Sometimes you already have a filehandle open, and want to make another handle that's a duplicate of the first one. In the shell, we place an ampersand in front of a file descriptor number when doing redirections. For example, C<< 2>&1 >> makes descriptor 2 (that's STDERR in Perl) be redirected into descriptor 1 (which is usually Perl's STDOUT). The same is essentially true in Perl: a filename that begins with an ampersand is treated instead as a file descriptor if a number, or as a filehandle if a string. open(SAVEOUT, ">&SAVEERR") || die "couldn't dup SAVEERR: $!"; open(MHCONTEXT, "<&4") || die "couldn't dup fd4: $!"; That means that if a function is expecting a filename, but you don't want to give it a filename because you already have the file open, you can just pass the filehandle with a leading ampersand. It's best to use a fully qualified handle though, just in case the function happens to be in a different package: somefunction("&main::LOGFILE"); This way if somefunction() is planning on opening its argument, it can just use the already opened handle. This differs from passing a handle, because with a handle, you don't open the file. Here you have something you can pass to open. If you have one of those tricky, newfangled I/O objects that the C++ folks are raving about, then this doesn't work because those aren't a proper filehandle in the native Perl sense. You'll have to use fileno() to pull out the proper descriptor number, assuming you can: use IO::Socket; $handle = IO::Socket::INET->new("www.perl.com:80"); $fd = $handle->fileno; somefunction("&$fd"); # not an indirect function call It can be easier (and certainly will be faster) just to use real filehandles though: use IO::Socket; local *REMOTE = IO::Socket::INET->new("www.perl.com:80"); die "can't connect" unless defined(fileno(REMOTE)); somefunction("&main::REMOTE"); If the filehandle or descriptor number is preceded not just with a simple "&" but rather with a "&=" combination, then Perl will not create a completely new descriptor opened to the same place using the dup(2) system call. Instead, it will just make something of an alias to the existing one using the fdopen(3S) library call. This is slightly more parsimonious of systems resources, although this is less a concern these days. Here's an example of that: $fd = $ENV{"MHCONTEXTFD"}; open(MHCONTEXT, "<&=$fd") or die "couldn't fdopen $fd: $!"; If you're using magic C<< >>, you could even pass in as a command line argument in @ARGV something like C<"<&=$MHCONTEXTFD">, but we've never seen anyone actually do this. =head2 Dispelling the Dweomer Perl is more of a DWIMmer language than something like Java--where DWIM is an acronym for "do what I mean". But this principle sometimes leads to more hidden magic than one knows what to do with. In this way, Perl is also filled with I, an obscure word meaning an enchantment. Sometimes, Perl's DWIMmer is just too much like dweomer for comfort. If magic C is a bit too magical for you, you don't have to turn to C. To open a file with arbitrary weird characters in it, it's necessary to protect any leading and trailing whitespace. Leading whitespace is protected by inserting a C<"./"> in front of a filename that starts with whitespace. Trailing whitespace is protected by appending an ASCII NUL byte (C<"\0">) at the end of the string. $file =~ s#^(\s)#./$1#; open(FH, "< $file\0") || die "can't open $file: $!"; This assumes, of course, that your system considers dot the current working directory, slash the directory separator, and disallows ASCII NULs within a valid filename. Most systems follow these conventions, including all POSIX systems as well as proprietary Microsoft systems. The only vaguely popular system that doesn't work this way is the "Classic" Macintosh system, which uses a colon where the rest of us use a slash. Maybe C isn't such a bad idea after all. If you want to use C<< >> processing in a totally boring and non-magical way, you could do this first: # "Sam sat on the ground and put his head in his hands. # 'I wish I had never come here, and I don't want to see # no more magic,' he said, and fell silent." for (@ARGV) { s#^([^./])#./$1#; $_ .= "\0"; } while (<>) { # now process $_ } But be warned that users will not appreciate being unable to use "-" to mean standard input, per the standard convention. =head2 Paths as Opens You've probably noticed how Perl's C and C functions can produce messages like: Some warning at scriptname line 29, line 7. That's because you opened a filehandle FH, and had read in seven records from it. But what was the name of the file, rather than the handle? If you aren't running with C, or if you've turned them off temporarily, then all you have to do is this: open($path, "< $path") || die "can't open $path: $!"; while (<$path>) { # whatever } Since you're using the pathname of the file as its handle, you'll get warnings more like Some warning at scriptname line 29, line 7. =head2 Single Argument Open Remember how we said that Perl's open took two arguments? That was a passive prevarication. You see, it can also take just one argument. If and only if the variable is a global variable, not a lexical, you can pass C just one argument, the filehandle, and it will get the path from the global scalar variable of the same name. $FILE = "/etc/motd"; open FILE or die "can't open $FILE: $!"; while () { # whatever } Why is this here? Someone has to cater to the hysterical porpoises. It's something that's been in Perl since the very beginning, if not before. =head2 Playing with STDIN and STDOUT One clever move with STDOUT is to explicitly close it when you're done with the program. END { close(STDOUT) || die "can't close stdout: $!" } If you don't do this, and your program fills up the disk partition due to a command line redirection, it won't report the error exit with a failure status. You don't have to accept the STDIN and STDOUT you were given. You are welcome to reopen them if you'd like. open(STDIN, "< datafile") || die "can't open datafile: $!"; open(STDOUT, "> output") || die "can't open output: $!"; And then these can be accessed directly or passed on to subprocesses. This makes it look as though the program were initially invoked with those redirections from the command line. It's probably more interesting to connect these to pipes. For example: $pager = $ENV{PAGER} || "(less || more)"; open(STDOUT, "| $pager") || die "can't fork a pager: $!"; This makes it appear as though your program were called with its stdout already piped into your pager. You can also use this kind of thing in conjunction with an implicit fork to yourself. You might do this if you would rather handle the post processing in your own program, just in a different process: head(100); while (<>) { print; } sub head { my $lines = shift || 20; return if $pid = open(STDOUT, "|-"); # return if parent die "cannot fork: $!" unless defined $pid; while () { last if --$lines < 0; print; } exit; } This technique can be applied to repeatedly push as many filters on your output stream as you wish. =head1 Other I/O Issues These topics aren't really arguments related to C or C, but they do affect what you do with your open files. =head2 Opening Non-File Files When is a file not a file? Well, you could say when it exists but isn't a plain file. We'll check whether it's a symbolic link first, just in case. if (-l $file || ! -f _) { print "$file is not a plain file\n"; } What other kinds of files are there than, well, files? Directories, symbolic links, named pipes, Unix-domain sockets, and block and character devices. Those are all files, too--just not I files. This isn't the same issue as being a text file. Not all text files are plain files. Not all plain files are text files. That's why there are separate C<-f> and C<-T> file tests. To open a directory, you should use the C function, then process it with C, carefully restoring the directory name if necessary: opendir(DIR, $dirname) or die "can't opendir $dirname: $!"; while (defined($file = readdir(DIR))) { # do something with "$dirname/$file" } closedir(DIR); If you want to process directories recursively, it's better to use the File::Find module. For example, this prints out all files recursively and adds a slash to their names if the file is a directory. @ARGV = qw(.) unless @ARGV; use File::Find; find sub { print $File::Find::name, -d && '/', "\n" }, @ARGV; This finds all bogus symbolic links beneath a particular directory: find sub { print "$File::Find::name\n" if -l && !-e }, $dir; As you see, with symbolic links, you can just pretend that it is what it points to. Or, if you want to know I it points to, then C is called for: if (-l $file) { if (defined($whither = readlink($file))) { print "$file points to $whither\n"; } else { print "$file points nowhere: $!\n"; } } =head2 Opening Named Pipes Named pipes are a different matter. You pretend they're regular files, but their opens will normally block until there is both a reader and a writer. You can read more about them in L. Unix-domain sockets are rather different beasts as well; they're described in L. When it comes to opening devices, it can be easy and it can be tricky. We'll assume that if you're opening up a block device, you know what you're doing. The character devices are more interesting. These are typically used for modems, mice, and some kinds of printers. This is described in L It's often enough to open them carefully: sysopen(TTYIN, "/dev/ttyS1", O_RDWR | O_NDELAY | O_NOCTTY) # (O_NOCTTY no longer needed on POSIX systems) or die "can't open /dev/ttyS1: $!"; open(TTYOUT, "+>&TTYIN") or die "can't dup TTYIN: $!"; $ofh = select(TTYOUT); $| = 1; select($ofh); print TTYOUT "+++at\015"; $answer = ; With descriptors that you haven't opened using C, such as sockets, you can set them to be non-blocking using C: use Fcntl; my $old_flags = fcntl($handle, F_GETFL, 0) or die "can't get flags: $!"; fcntl($handle, F_SETFL, $old_flags | O_NONBLOCK) or die "can't set non blocking: $!"; Rather than losing yourself in a morass of twisting, turning Cs, all dissimilar, if you're going to manipulate ttys, it's best to make calls out to the stty(1) program if you have it, or else use the portable POSIX interface. To figure this all out, you'll need to read the termios(3) manpage, which describes the POSIX interface to tty devices, and then L, which describes Perl's interface to POSIX. There are also some high-level modules on CPAN that can help you with these games. Check out Term::ReadKey and Term::ReadLine. =head2 Opening Sockets What else can you open? To open a connection using sockets, you won't use one of Perl's two open functions. See L for that. Here's an example. Once you have it, you can use FH as a bidirectional filehandle. use IO::Socket; local *FH = IO::Socket::INET->new("www.perl.com:80"); For opening up a URL, the LWP modules from CPAN are just what the doctor ordered. There's no filehandle interface, but it's still easy to get the contents of a document: use LWP::Simple; $doc = get('http://www.cpan.org/'); =head2 Binary Files On certain legacy systems with what could charitably be called terminally convoluted (some would say broken) I/O models, a file isn't a file--at least, not with respect to the C standard I/O library. On these old systems whose libraries (but not kernels) distinguish between text and binary streams, to get files to behave properly you'll have to bend over backwards to avoid nasty problems. On such infelicitous systems, sockets and pipes are already opened in binary mode, and there is currently no way to turn that off. With files, you have more options. Another option is to use the C function on the appropriate handles before doing regular I/O on them: binmode(STDIN); binmode(STDOUT); while () { print } Passing C a non-standard flag option will also open the file in binary mode on those systems that support it. This is the equivalent of opening the file normally, then calling C on the handle. sysopen(BINDAT, "records.data", O_RDWR | O_BINARY) || die "can't open records.data: $!"; Now you can use C and C on that handle without worrying about the non-standard system I/O library breaking your data. It's not a pretty picture, but then, legacy systems seldom are. CP/M will be with us until the end of days, and after. On systems with exotic I/O systems, it turns out that, astonishingly enough, even unbuffered I/O using C and C might do sneaky data mutilation behind your back. while (sysread(WHENCE, $buf, 1024)) { syswrite(WHITHER, $buf, length($buf)); } Depending on the vicissitudes of your runtime system, even these calls may need C or C first. Systems known to be free of such difficulties include Unix, the Mac OS, Plan 9, and Inferno. =head2 File Locking In a multitasking environment, you may need to be careful not to collide with other processes who want to do I/O on the same files as you are working on. You'll often need shared or exclusive locks on files for reading and writing respectively. You might just pretend that only exclusive locks exist. Never use the existence of a file C<-e $file> as a locking indication, because there is a race condition between the test for the existence of the file and its creation. It's possible for another process to create a file in the slice of time between your existence check and your attempt to create the file. Atomicity is critical. Perl's most portable locking interface is via the C function, whose simplicity is emulated on systems that don't directly support it such as SysV or Windows. The underlying semantics may affect how it all works, so you should learn how C is implemented on your system's port of Perl. File locking I lock out another process that would like to do I/O. A file lock only locks out others trying to get a lock, not processes trying to do I/O. Because locks are advisory, if one process uses locking and another doesn't, all bets are off. By default, the C call will block until a lock is granted. A request for a shared lock will be granted as soon as there is no exclusive locker. A request for an exclusive lock will be granted as soon as there is no locker of any kind. Locks are on file descriptors, not file names. You can't lock a file until you open it, and you can't hold on to a lock once the file has been closed. Here's how to get a blocking shared lock on a file, typically used for reading: use 5.004; use Fcntl qw(:DEFAULT :flock); open(FH, "< filename") or die "can't open filename: $!"; flock(FH, LOCK_SH) or die "can't lock filename: $!"; # now read from FH You can get a non-blocking lock by using C. flock(FH, LOCK_SH | LOCK_NB) or die "can't lock filename: $!"; This can be useful for producing more user-friendly behaviour by warning if you're going to be blocking: use 5.004; use Fcntl qw(:DEFAULT :flock); open(FH, "< filename") or die "can't open filename: $!"; unless (flock(FH, LOCK_SH | LOCK_NB)) { $| = 1; print "Waiting for lock..."; flock(FH, LOCK_SH) or die "can't lock filename: $!"; print "got it.\n" } # now read from FH To get an exclusive lock, typically used for writing, you have to be careful. We C the file so it can be locked before it gets emptied. You can get a nonblocking version using C. use 5.004; use Fcntl qw(:DEFAULT :flock); sysopen(FH, "filename", O_WRONLY | O_CREAT) or die "can't open filename: $!"; flock(FH, LOCK_EX) or die "can't lock filename: $!"; truncate(FH, 0) or die "can't truncate filename: $!"; # now write to FH Finally, due to the uncounted millions who cannot be dissuaded from wasting cycles on useless vanity devices called hit counters, here's how to increment a number in a file safely: use Fcntl qw(:DEFAULT :flock); sysopen(FH, "numfile", O_RDWR | O_CREAT) or die "can't open numfile: $!"; # autoflush FH $ofh = select(FH); $| = 1; select ($ofh); flock(FH, LOCK_EX) or die "can't write-lock numfile: $!"; $num = || 0; seek(FH, 0, 0) or die "can't rewind numfile : $!"; print FH $num+1, "\n" or die "can't write numfile: $!"; truncate(FH, tell(FH)) or die "can't truncate numfile: $!"; close(FH) or die "can't close numfile: $!"; =head2 IO Layers In Perl 5.8.0 a new I/O framework called "PerlIO" was introduced. This is a new "plumbing" for all the I/O happening in Perl; for the most part everything will work just as it did, but PerlIO also brought in some new features such as the ability to think of I/O as "layers". One I/O layer may in addition to just moving the data also do transformations on the data. Such transformations may include compression and decompression, encryption and decryption, and transforming between various character encodings. Full discussion about the features of PerlIO is out of scope for this tutorial, but here is how to recognize the layers being used: =over 4 =item * The three-(or more)-argument form of C is being used and the second argument contains something else in addition to the usual C<< '<' >>, C<< '>' >>, C<< '>>' >>, C<< '|' >> and their variants, for example: open(my $fh, "<:crlf", $fn); =item * The two-argument form of C is being used, for example binmode($fh, ":encoding(utf16)"); =back For more detailed discussion about PerlIO see L; for more detailed discussion about Unicode and I/O see L. =head1 SEE ALSO The C and C functions in perlfunc(1); the system open(2), dup(2), fopen(3), and fdopen(3) manpages; the POSIX documentation. =head1 AUTHOR and COPYRIGHT Copyright 1998 Tom Christiansen. This documentation is free; you can redistribute it and/or modify it under the same terms as Perl itself. Irrespective of its distribution, all code examples in these files are hereby placed into the public domain. You are permitted and encouraged to use this code in your own programs for fun or for profit as you see fit. A simple comment in the code giving credit would be courteous but is not required. =head1 HISTORY First release: Sat Jan 9 08:09:11 MST 1999 perlfork.pod000064400000031441147633747270007123 0ustar00=head1 NAME perlfork - Perl's fork() emulation =head1 SYNOPSIS NOTE: As of the 5.8.0 release, fork() emulation has considerably matured. However, there are still a few known bugs and differences from real fork() that might affect you. See the "BUGS" and "CAVEATS AND LIMITATIONS" sections below. Perl provides a fork() keyword that corresponds to the Unix system call of the same name. On most Unix-like platforms where the fork() system call is available, Perl's fork() simply calls it. On some platforms such as Windows where the fork() system call is not available, Perl can be built to emulate fork() at the interpreter level. While the emulation is designed to be as compatible as possible with the real fork() at the level of the Perl program, there are certain important differences that stem from the fact that all the pseudo child "processes" created this way live in the same real process as far as the operating system is concerned. This document provides a general overview of the capabilities and limitations of the fork() emulation. Note that the issues discussed here are not applicable to platforms where a real fork() is available and Perl has been configured to use it. =head1 DESCRIPTION The fork() emulation is implemented at the level of the Perl interpreter. What this means in general is that running fork() will actually clone the running interpreter and all its state, and run the cloned interpreter in a separate thread, beginning execution in the new thread just after the point where the fork() was called in the parent. We will refer to the thread that implements this child "process" as the pseudo-process. To the Perl program that called fork(), all this is designed to be transparent. The parent returns from the fork() with a pseudo-process ID that can be subsequently used in any process-manipulation functions; the child returns from the fork() with a value of C<0> to signify that it is the child pseudo-process. =head2 Behavior of other Perl features in forked pseudo-processes Most Perl features behave in a natural way within pseudo-processes. =over 8 =item $$ or $PROCESS_ID This special variable is correctly set to the pseudo-process ID. It can be used to identify pseudo-processes within a particular session. Note that this value is subject to recycling if any pseudo-processes are launched after others have been wait()-ed on. =item %ENV Each pseudo-process maintains its own virtual environment. Modifications to %ENV affect the virtual environment, and are only visible within that pseudo-process, and in any processes (or pseudo-processes) launched from it. =item chdir() and all other builtins that accept filenames Each pseudo-process maintains its own virtual idea of the current directory. Modifications to the current directory using chdir() are only visible within that pseudo-process, and in any processes (or pseudo-processes) launched from it. All file and directory accesses from the pseudo-process will correctly map the virtual working directory to the real working directory appropriately. =item wait() and waitpid() wait() and waitpid() can be passed a pseudo-process ID returned by fork(). These calls will properly wait for the termination of the pseudo-process and return its status. =item kill() C can be used to terminate a pseudo-process by passing it the ID returned by fork(). The outcome of kill on a pseudo-process is unpredictable and it should not be used except under dire circumstances, because the operating system may not guarantee integrity of the process resources when a running thread is terminated. The process which implements the pseudo-processes can be blocked and the Perl interpreter hangs. Note that using C on a pseudo-process() may typically cause memory leaks, because the thread that implements the pseudo-process does not get a chance to clean up its resources. C can also be used on pseudo-processes, but the signal will not be delivered while the pseudo-process is blocked by a system call, e.g. waiting for a socket to connect, or trying to read from a socket with no data available. Starting in Perl 5.14 the parent process will not wait for children to exit once they have been signalled with C to avoid deadlock during process exit. You will have to explicitly call waitpid() to make sure the child has time to clean-up itself, but you are then also responsible that the child is not blocking on I/O either. =item exec() Calling exec() within a pseudo-process actually spawns the requested executable in a separate process and waits for it to complete before exiting with the same exit status as that process. This means that the process ID reported within the running executable will be different from what the earlier Perl fork() might have returned. Similarly, any process manipulation functions applied to the ID returned by fork() will affect the waiting pseudo-process that called exec(), not the real process it is waiting for after the exec(). When exec() is called inside a pseudo-process then DESTROY methods and END blocks will still be called after the external process returns. =item exit() exit() always exits just the executing pseudo-process, after automatically wait()-ing for any outstanding child pseudo-processes. Note that this means that the process as a whole will not exit unless all running pseudo-processes have exited. See below for some limitations with open filehandles. =item Open handles to files, directories and network sockets All open handles are dup()-ed in pseudo-processes, so that closing any handles in one process does not affect the others. See below for some limitations. =back =head2 Resource limits In the eyes of the operating system, pseudo-processes created via the fork() emulation are simply threads in the same process. This means that any process-level limits imposed by the operating system apply to all pseudo-processes taken together. This includes any limits imposed by the operating system on the number of open file, directory and socket handles, limits on disk space usage, limits on memory size, limits on CPU utilization etc. =head2 Killing the parent process If the parent process is killed (either using Perl's kill() builtin, or using some external means) all the pseudo-processes are killed as well, and the whole process exits. =head2 Lifetime of the parent process and pseudo-processes During the normal course of events, the parent process and every pseudo-process started by it will wait for their respective pseudo-children to complete before they exit. This means that the parent and every pseudo-child created by it that is also a pseudo-parent will only exit after their pseudo-children have exited. Starting with Perl 5.14 a parent will not wait() automatically for any child that has been signalled with C to avoid a deadlock in case the child is blocking on I/O and never receives the signal. =head1 CAVEATS AND LIMITATIONS =over 8 =item BEGIN blocks The fork() emulation will not work entirely correctly when called from within a BEGIN block. The forked copy will run the contents of the BEGIN block, but will not continue parsing the source stream after the BEGIN block. For example, consider the following code: BEGIN { fork and exit; # fork child and exit the parent print "inner\n"; } print "outer\n"; This will print: inner rather than the expected: inner outer This limitation arises from fundamental technical difficulties in cloning and restarting the stacks used by the Perl parser in the middle of a parse. =item Open filehandles Any filehandles open at the time of the fork() will be dup()-ed. Thus, the files can be closed independently in the parent and child, but beware that the dup()-ed handles will still share the same seek pointer. Changing the seek position in the parent will change it in the child and vice-versa. One can avoid this by opening files that need distinct seek pointers separately in the child. On some operating systems, notably Solaris and Unixware, calling C from a child process will flush and close open filehandles in the parent, thereby corrupting the filehandles. On these systems, calling C<_exit()> is suggested instead. C<_exit()> is available in Perl through the C module. Please consult your system's manpages for more information on this. =item Open directory handles Perl will completely read from all open directory handles until they reach the end of the stream. It will then seekdir() back to the original location and all future readdir() requests will be fulfilled from the cache buffer. That means that neither the directory handle held by the parent process nor the one held by the child process will see any changes made to the directory after the fork() call. Note that rewinddir() has a similar limitation on Windows and will not force readdir() to read the directory again either. Only a newly opened directory handle will reflect changes to the directory. =item Forking pipe open() not yet implemented The C and C constructs are not yet implemented. This limitation can be easily worked around in new code by creating a pipe explicitly. The following example shows how to write to a forked child: # simulate open(FOO, "|-") sub pipe_to_fork ($) { my $parent = shift; pipe my $child, $parent or die; my $pid = fork(); die "fork() failed: $!" unless defined $pid; if ($pid) { close $child; } else { close $parent; open(STDIN, "<&=" . fileno($child)) or die; } $pid; } if (pipe_to_fork('FOO')) { # parent print FOO "pipe_to_fork\n"; close FOO; } else { # child while () { print; } exit(0); } And this one reads from the child: # simulate open(FOO, "-|") sub pipe_from_fork ($) { my $parent = shift; pipe $parent, my $child or die; my $pid = fork(); die "fork() failed: $!" unless defined $pid; if ($pid) { close $child; } else { close $parent; open(STDOUT, ">&=" . fileno($child)) or die; } $pid; } if (pipe_from_fork('BAR')) { # parent while () { print; } close BAR; } else { # child print "pipe_from_fork\n"; exit(0); } Forking pipe open() constructs will be supported in future. =item Global state maintained by XSUBs External subroutines (XSUBs) that maintain their own global state may not work correctly. Such XSUBs will either need to maintain locks to protect simultaneous access to global data from different pseudo-processes, or maintain all their state on the Perl symbol table, which is copied naturally when fork() is called. A callback mechanism that provides extensions an opportunity to clone their state will be provided in the near future. =item Interpreter embedded in larger application The fork() emulation may not behave as expected when it is executed in an application which embeds a Perl interpreter and calls Perl APIs that can evaluate bits of Perl code. This stems from the fact that the emulation only has knowledge about the Perl interpreter's own data structures and knows nothing about the containing application's state. For example, any state carried on the application's own call stack is out of reach. =item Thread-safety of extensions Since the fork() emulation runs code in multiple threads, extensions calling into non-thread-safe libraries may not work reliably when calling fork(). As Perl's threading support gradually becomes more widely adopted even on platforms with a native fork(), such extensions are expected to be fixed for thread-safety. =back =head1 PORTABILITY CAVEATS In portable Perl code, C must not be used on forked processes. Killing a forked process is unsafe and has unpredictable results. See L, above. =head1 BUGS =over 8 =item * Having pseudo-process IDs be negative integers breaks down for the integer C<-1> because the wait() and waitpid() functions treat this number as being special. The tacit assumption in the current implementation is that the system never allocates a thread ID of C<1> for user threads. A better representation for pseudo-process IDs will be implemented in future. =item * In certain cases, the OS-level handles created by the pipe(), socket(), and accept() operators are apparently not duplicated accurately in pseudo-processes. This only happens in some situations, but where it does happen, it may result in deadlocks between the read and write ends of pipe handles, or inability to send or receive data across socket handles. =item * This document may be incomplete in some respects. =back =head1 AUTHOR Support for concurrent interpreters and the fork() emulation was implemented by ActiveState, with funding from Microsoft Corporation. This document is authored and maintained by Gurusamy Sarathy Egsar@activestate.comE. =head1 SEE ALSO L, L =cut perlguts.pod000064400000337250147633747270007153 0ustar00=head1 NAME perlguts - Introduction to the Perl API =head1 DESCRIPTION This document attempts to describe how to use the Perl API, as well as to provide some info on the basic workings of the Perl core. It is far from complete and probably contains many errors. Please refer any questions or comments to the author below. =head1 Variables =head2 Datatypes Perl has three typedefs that handle Perl's three main data types: SV Scalar Value AV Array Value HV Hash Value Each typedef has specific routines that manipulate the various data types. =head2 What is an "IV"? Perl uses a special typedef IV which is a simple signed integer type that is guaranteed to be large enough to hold a pointer (as well as an integer). Additionally, there is the UV, which is simply an unsigned IV. Perl also uses two special typedefs, I32 and I16, which will always be at least 32-bits and 16-bits long, respectively. (Again, there are U32 and U16, as well.) They will usually be exactly 32 and 16 bits long, but on Crays they will both be 64 bits. =head2 Working with SVs An SV can be created and loaded with one command. There are five types of values that can be loaded: an integer value (IV), an unsigned integer value (UV), a double (NV), a string (PV), and another scalar (SV). The seven routines are: SV* newSViv(IV); SV* newSVuv(UV); SV* newSVnv(double); SV* newSVpv(const char*, STRLEN); SV* newSVpvn(const char*, STRLEN); SV* newSVpvf(const char*, ...); SV* newSVsv(SV*); C is an integer type (Size_t, usually defined as size_t in F) guaranteed to be large enough to represent the size of any string that perl can handle. In the unlikely case of a SV requiring more complex initialisation, you can create an empty SV with newSV(len). If C is 0 an empty SV of type NULL is returned, else an SV of type PV is returned with len + 1 (for the NUL) bytes of storage allocated, accessible via SvPVX. In both cases the SV has the undef value. SV *sv = newSV(0); /* no storage allocated */ SV *sv = newSV(10); /* 10 (+1) bytes of uninitialised storage * allocated */ To change the value of an I SV, there are eight routines: void sv_setiv(SV*, IV); void sv_setuv(SV*, UV); void sv_setnv(SV*, double); void sv_setpv(SV*, const char*); void sv_setpvn(SV*, const char*, STRLEN) void sv_setpvf(SV*, const char*, ...); void sv_vsetpvfn(SV*, const char*, STRLEN, va_list *, SV **, I32, bool *); void sv_setsv(SV*, SV*); Notice that you can choose to specify the length of the string to be assigned by using C, C, or C, or you may allow Perl to calculate the length by using C or by specifying 0 as the second argument to C. Be warned, though, that Perl will determine the string's length by using C, which depends on the string terminating with a NUL character, and not otherwise containing NULs. The arguments of C are processed like C, and the formatted output becomes the value. C is an analogue of C, but it allows you to specify either a pointer to a variable argument list or the address and length of an array of SVs. The last argument points to a boolean; on return, if that boolean is true, then locale-specific information has been used to format the string, and the string's contents are therefore untrustworthy (see L). This pointer may be NULL if that information is not important. Note that this function requires you to specify the length of the format. The C functions are not generic enough to operate on values that have "magic". See L later in this document. All SVs that contain strings should be terminated with a NUL character. If it is not NUL-terminated there is a risk of core dumps and corruptions from code which passes the string to C functions or system calls which expect a NUL-terminated string. Perl's own functions typically add a trailing NUL for this reason. Nevertheless, you should be very careful when you pass a string stored in an SV to a C function or system call. To access the actual value that an SV points to, you can use the macros: SvIV(SV*) SvUV(SV*) SvNV(SV*) SvPV(SV*, STRLEN len) SvPV_nolen(SV*) which will automatically coerce the actual scalar type into an IV, UV, double, or string. In the C macro, the length of the string returned is placed into the variable C (this is a macro, so you do I use C<&len>). If you do not care what the length of the data is, use the C macro. Historically the C macro with the global variable C has been used in this case. But that can be quite inefficient because C must be accessed in thread-local storage in threaded Perl. In any case, remember that Perl allows arbitrary strings of data that may both contain NULs and might not be terminated by a NUL. Also remember that C doesn't allow you to safely say C. It might work with your compiler, but it won't work for everyone. Break this sort of statement up into separate assignments: SV *s; STRLEN len; char *ptr; ptr = SvPV(s, len); foo(ptr, len); If you want to know if the scalar value is TRUE, you can use: SvTRUE(SV*) Although Perl will automatically grow strings for you, if you need to force Perl to allocate more memory for your SV, you can use the macro SvGROW(SV*, STRLEN newlen) which will determine if more memory needs to be allocated. If so, it will call the function C. Note that C can only increase, not decrease, the allocated memory of an SV and that it does not automatically add space for the trailing NUL byte (perl's own string functions typically do C). If you have an SV and want to know what kind of data Perl thinks is stored in it, you can use the following macros to check the type of SV you have. SvIOK(SV*) SvNOK(SV*) SvPOK(SV*) You can get and set the current length of the string stored in an SV with the following macros: SvCUR(SV*) SvCUR_set(SV*, I32 val) You can also get a pointer to the end of the string stored in the SV with the macro: SvEND(SV*) But note that these last three macros are valid only if C is true. If you want to append something to the end of string stored in an C, you can use the following functions: void sv_catpv(SV*, const char*); void sv_catpvn(SV*, const char*, STRLEN); void sv_catpvf(SV*, const char*, ...); void sv_vcatpvfn(SV*, const char*, STRLEN, va_list *, SV **, I32, bool); void sv_catsv(SV*, SV*); The first function calculates the length of the string to be appended by using C. In the second, you specify the length of the string yourself. The third function processes its arguments like C and appends the formatted output. The fourth function works like C. You can specify the address and length of an array of SVs instead of the va_list argument. The fifth function extends the string stored in the first SV with the string stored in the second SV. It also forces the second SV to be interpreted as a string. The C functions are not generic enough to operate on values that have "magic". See L later in this document. If you know the name of a scalar variable, you can get a pointer to its SV by using the following: SV* get_sv("package::varname", 0); This returns NULL if the variable does not exist. If you want to know if this variable (or any other SV) is actually C, you can call: SvOK(SV*) The scalar C value is stored in an SV instance called C. Its address can be used whenever an C is needed. Make sure that you don't try to compare a random sv with C<&PL_sv_undef>. For example when interfacing Perl code, it'll work correctly for: foo(undef); But won't work when called as: $x = undef; foo($x); So to repeat always use SvOK() to check whether an sv is defined. Also you have to be careful when using C<&PL_sv_undef> as a value in AVs or HVs (see L). There are also the two values C and C, which contain boolean TRUE and FALSE values, respectively. Like C, their addresses can be used whenever an C is needed. Do not be fooled into thinking that C<(SV *) 0> is the same as C<&PL_sv_undef>. Take this code: SV* sv = (SV*) 0; if (I-am-to-return-a-real-value) { sv = sv_2mortal(newSViv(42)); } sv_setsv(ST(0), sv); This code tries to return a new SV (which contains the value 42) if it should return a real value, or undef otherwise. Instead it has returned a NULL pointer which, somewhere down the line, will cause a segmentation violation, bus error, or just weird results. Change the zero to C<&PL_sv_undef> in the first line and all will be well. To free an SV that you've created, call C. Normally this call is not necessary (see L). =head2 Offsets Perl provides the function C to efficiently remove characters from the beginning of a string; you give it an SV and a pointer to somewhere inside the PV, and it discards everything before the pointer. The efficiency comes by means of a little hack: instead of actually removing the characters, C sets the flag C (offset OK) to signal to other functions that the offset hack is in effect, and it puts the number of bytes chopped off into the IV field of the SV. It then moves the PV pointer (called C) forward that many bytes, and adjusts C and C. Hence, at this point, the start of the buffer that we allocated lives at C in memory and the PV pointer is pointing into the middle of this allocated storage. This is best demonstrated by example: % ./perl -Ilib -MDevel::Peek -le '$a="12345"; $a=~s/.//; Dump($a)' SV = PVIV(0x8128450) at 0x81340f0 REFCNT = 1 FLAGS = (POK,OOK,pPOK) IV = 1 (OFFSET) PV = 0x8135781 ( "1" . ) "2345"\0 CUR = 4 LEN = 5 Here the number of bytes chopped off (1) is put into IV, and C helpfully reminds us that this is an offset. The portion of the string between the "real" and the "fake" beginnings is shown in parentheses, and the values of C and C reflect the fake beginning, not the real one. Something similar to the offset hack is performed on AVs to enable efficient shifting and splicing off the beginning of the array; while C points to the first element in the array that is visible from Perl, C points to the real start of the C array. These are usually the same, but a C operation can be carried out by increasing C by one and decreasing C and C. Again, the location of the real start of the C array only comes into play when freeing the array. See C in F. =head2 What's Really Stored in an SV? Recall that the usual method of determining the type of scalar you have is to use C macros. Because a scalar can be both a number and a string, usually these macros will always return TRUE and calling the C macros will do the appropriate conversion of string to integer/double or integer/double to string. If you I need to know if you have an integer, double, or string pointer in an SV, you can use the following three macros instead: SvIOKp(SV*) SvNOKp(SV*) SvPOKp(SV*) These will tell you if you truly have an integer, double, or string pointer stored in your SV. The "p" stands for private. There are various ways in which the private and public flags may differ. For example, a tied SV may have a valid underlying value in the IV slot (so SvIOKp is true), but the data should be accessed via the FETCH routine rather than directly, so SvIOK is false. Another is when numeric conversion has occurred and precision has been lost: only the private flag is set on 'lossy' values. So when an NV is converted to an IV with loss, SvIOKp, SvNOKp and SvNOK will be set, while SvIOK wont be. In general, though, it's best to use the C macros. =head2 Working with AVs There are two ways to create and load an AV. The first method creates an empty AV: AV* newAV(); The second method both creates the AV and initially populates it with SVs: AV* av_make(I32 num, SV **ptr); The second argument points to an array containing C C's. Once the AV has been created, the SVs can be destroyed, if so desired. Once the AV has been created, the following operations are possible on it: void av_push(AV*, SV*); SV* av_pop(AV*); SV* av_shift(AV*); void av_unshift(AV*, I32 num); These should be familiar operations, with the exception of C. This routine adds C elements at the front of the array with the C value. You must then use C (described below) to assign values to these new elements. Here are some other functions: I32 av_len(AV*); SV** av_fetch(AV*, I32 key, I32 lval); SV** av_store(AV*, I32 key, SV* val); The C function returns the highest index value in an array (just like $#array in Perl). If the array is empty, -1 is returned. The C function returns the value at index C, but if C is non-zero, then C will store an undef value at that index. The C function stores the value C at index C, and does not increment the reference count of C. Thus the caller is responsible for taking care of that, and if C returns NULL, the caller will have to decrement the reference count to avoid a memory leak. Note that C and C both return C's, not C's as their return value. A few more: void av_clear(AV*); void av_undef(AV*); void av_extend(AV*, I32 key); The C function deletes all the elements in the AV* array, but does not actually delete the array itself. The C function will delete all the elements in the array plus the array itself. The C function extends the array so that it contains at least C elements. If C is less than the currently allocated length of the array, then nothing is done. If you know the name of an array variable, you can get a pointer to its AV by using the following: AV* get_av("package::varname", 0); This returns NULL if the variable does not exist. See L for more information on how to use the array access functions on tied arrays. =head2 Working with HVs To create an HV, you use the following routine: HV* newHV(); Once the HV has been created, the following operations are possible on it: SV** hv_store(HV*, const char* key, U32 klen, SV* val, U32 hash); SV** hv_fetch(HV*, const char* key, U32 klen, I32 lval); The C parameter is the length of the key being passed in (Note that you cannot pass 0 in as a value of C to tell Perl to measure the length of the key). The C argument contains the SV pointer to the scalar being stored, and C is the precomputed hash value (zero if you want C to calculate it for you). The C parameter indicates whether this fetch is actually a part of a store operation, in which case a new undefined value will be added to the HV with the supplied key and C will return as if the value had already existed. Remember that C and C return C's and not just C. To access the scalar value, you must first dereference the return value. However, you should check to make sure that the return value is not NULL before dereferencing it. The first of these two functions checks if a hash table entry exists, and the second deletes it. bool hv_exists(HV*, const char* key, U32 klen); SV* hv_delete(HV*, const char* key, U32 klen, I32 flags); If C does not include the C flag then C will create and return a mortal copy of the deleted value. And more miscellaneous functions: void hv_clear(HV*); void hv_undef(HV*); Like their AV counterparts, C deletes all the entries in the hash table but does not actually delete the hash table. The C deletes both the entries and the hash table itself. Perl keeps the actual data in a linked list of structures with a typedef of HE. These contain the actual key and value pointers (plus extra administrative overhead). The key is a string pointer; the value is an C. However, once you have an C, to get the actual key and value, use the routines specified below. I32 hv_iterinit(HV*); /* Prepares starting point to traverse hash table */ HE* hv_iternext(HV*); /* Get the next entry, and return a pointer to a structure that has both the key and value */ char* hv_iterkey(HE* entry, I32* retlen); /* Get the key from an HE structure and also return the length of the key string */ SV* hv_iterval(HV*, HE* entry); /* Return an SV pointer to the value of the HE structure */ SV* hv_iternextsv(HV*, char** key, I32* retlen); /* This convenience routine combines hv_iternext, hv_iterkey, and hv_iterval. The key and retlen arguments are return values for the key and its length. The value is returned in the SV* argument */ If you know the name of a hash variable, you can get a pointer to its HV by using the following: HV* get_hv("package::varname", 0); This returns NULL if the variable does not exist. The hash algorithm is defined in the C macro: hash = 0; while (klen--) hash = (hash * 33) + *key++; hash = hash + (hash >> 5); /* after 5.6 */ The last step was added in version 5.6 to improve distribution of lower bits in the resulting hash value. See L for more information on how to use the hash access functions on tied hashes. =head2 Hash API Extensions Beginning with version 5.004, the following functions are also supported: HE* hv_fetch_ent (HV* tb, SV* key, I32 lval, U32 hash); HE* hv_store_ent (HV* tb, SV* key, SV* val, U32 hash); bool hv_exists_ent (HV* tb, SV* key, U32 hash); SV* hv_delete_ent (HV* tb, SV* key, I32 flags, U32 hash); SV* hv_iterkeysv (HE* entry); Note that these functions take C keys, which simplifies writing of extension code that deals with hash structures. These functions also allow passing of C keys to C functions without forcing you to stringify the keys (unlike the previous set of functions). They also return and accept whole hash entries (C), making their use more efficient (since the hash number for a particular string doesn't have to be recomputed every time). See L for detailed descriptions. The following macros must always be used to access the contents of hash entries. Note that the arguments to these macros must be simple variables, since they may get evaluated more than once. See L for detailed descriptions of these macros. HePV(HE* he, STRLEN len) HeVAL(HE* he) HeHASH(HE* he) HeSVKEY(HE* he) HeSVKEY_force(HE* he) HeSVKEY_set(HE* he, SV* sv) These two lower level macros are defined, but must only be used when dealing with keys that are not Cs: HeKEY(HE* he) HeKLEN(HE* he) Note that both C and C do not increment the reference count of the stored C, which is the caller's responsibility. If these functions return a NULL value, the caller will usually have to decrement the reference count of C to avoid a memory leak. =head2 AVs, HVs and undefined values Sometimes you have to store undefined values in AVs or HVs. Although this may be a rare case, it can be tricky. That's because you're used to using C<&PL_sv_undef> if you need an undefined SV. For example, intuition tells you that this XS code: AV *av = newAV(); av_store( av, 0, &PL_sv_undef ); is equivalent to this Perl code: my @av; $av[0] = undef; Unfortunately, this isn't true. AVs use C<&PL_sv_undef> as a marker for indicating that an array element has not yet been initialized. Thus, C would be true for the above Perl code, but false for the array generated by the XS code. Other problems can occur when storing C<&PL_sv_undef> in HVs: hv_store( hv, "key", 3, &PL_sv_undef, 0 ); This will indeed make the value C, but if you try to modify the value of C, you'll get the following error: Modification of non-creatable hash value attempted In perl 5.8.0, C<&PL_sv_undef> was also used to mark placeholders in restricted hashes. This caused such hash entries not to appear when iterating over the hash or when checking for the keys with the C function. You can run into similar problems when you store C<&PL_sv_yes> or C<&PL_sv_no> into AVs or HVs. Trying to modify such elements will give you the following error: Modification of a read-only value attempted To make a long story short, you can use the special variables C<&PL_sv_undef>, C<&PL_sv_yes> and C<&PL_sv_no> with AVs and HVs, but you have to make sure you know what you're doing. Generally, if you want to store an undefined value in an AV or HV, you should not use C<&PL_sv_undef>, but rather create a new undefined value using the C function, for example: av_store( av, 42, newSV(0) ); hv_store( hv, "foo", 3, newSV(0), 0 ); =head2 References References are a special type of scalar that point to other data types (including other references). To create a reference, use either of the following functions: SV* newRV_inc((SV*) thing); SV* newRV_noinc((SV*) thing); The C argument can be any of an C, C, or C. The functions are identical except that C increments the reference count of the C, while C does not. For historical reasons, C is a synonym for C. Once you have a reference, you can use the following macro to dereference the reference: SvRV(SV*) then call the appropriate routines, casting the returned C to either an C or C, if required. To determine if an SV is a reference, you can use the following macro: SvROK(SV*) To discover what type of value the reference refers to, use the following macro and then check the return value. SvTYPE(SvRV(SV*)) The most useful types that will be returned are: SVt_IV Scalar SVt_NV Scalar SVt_PV Scalar SVt_RV Scalar SVt_PVAV Array SVt_PVHV Hash SVt_PVCV Code SVt_PVGV Glob (possibly a file handle) SVt_PVMG Blessed or Magical Scalar See the F header file for more details. =head2 Blessed References and Class Objects References are also used to support object-oriented programming. In perl's OO lexicon, an object is simply a reference that has been blessed into a package (or class). Once blessed, the programmer may now use the reference to access the various methods in the class. A reference can be blessed into a package with the following function: SV* sv_bless(SV* sv, HV* stash); The C argument must be a reference value. The C argument specifies which class the reference will belong to. See L for information on converting class names into stashes. /* Still under construction */ The following function upgrades rv to reference if not already one. Creates a new SV for rv to point to. If C is non-null, the SV is blessed into the specified class. SV is returned. SV* newSVrv(SV* rv, const char* classname); The following three functions copy integer, unsigned integer or double into an SV whose reference is C. SV is blessed if C is non-null. SV* sv_setref_iv(SV* rv, const char* classname, IV iv); SV* sv_setref_uv(SV* rv, const char* classname, UV uv); SV* sv_setref_nv(SV* rv, const char* classname, NV iv); The following function copies the pointer value (I) into an SV whose reference is rv. SV is blessed if C is non-null. SV* sv_setref_pv(SV* rv, const char* classname, void* pv); The following function copies a string into an SV whose reference is C. Set length to 0 to let Perl calculate the string length. SV is blessed if C is non-null. SV* sv_setref_pvn(SV* rv, const char* classname, char* pv, STRLEN length); The following function tests whether the SV is blessed into the specified class. It does not check inheritance relationships. int sv_isa(SV* sv, const char* name); The following function tests whether the SV is a reference to a blessed object. int sv_isobject(SV* sv); The following function tests whether the SV is derived from the specified class. SV can be either a reference to a blessed object or a string containing a class name. This is the function implementing the C functionality. bool sv_derived_from(SV* sv, const char* name); To check if you've got an object derived from a specific class you have to write: if (sv_isobject(sv) && sv_derived_from(sv, class)) { ... } =head2 Creating New Variables To create a new Perl variable with an undef value which can be accessed from your Perl script, use the following routines, depending on the variable type. SV* get_sv("package::varname", GV_ADD); AV* get_av("package::varname", GV_ADD); HV* get_hv("package::varname", GV_ADD); Notice the use of GV_ADD as the second parameter. The new variable can now be set, using the routines appropriate to the data type. There are additional macros whose values may be bitwise OR'ed with the C argument to enable certain extra features. Those bits are: =over =item GV_ADDMULTI Marks the variable as multiply defined, thus preventing the: Name used only once: possible typo warning. =item GV_ADDWARN Issues the warning: Had to create unexpectedly if the variable did not exist before the function was called. =back If you do not specify a package name, the variable is created in the current package. =head2 Reference Counts and Mortality Perl uses a reference count-driven garbage collection mechanism. SVs, AVs, or HVs (xV for short in the following) start their life with a reference count of 1. If the reference count of an xV ever drops to 0, then it will be destroyed and its memory made available for reuse. This normally doesn't happen at the Perl level unless a variable is undef'ed or the last variable holding a reference to it is changed or overwritten. At the internal level, however, reference counts can be manipulated with the following macros: int SvREFCNT(SV* sv); SV* SvREFCNT_inc(SV* sv); void SvREFCNT_dec(SV* sv); However, there is one other function which manipulates the reference count of its argument. The C function, you will recall, creates a reference to the specified argument. As a side effect, it increments the argument's reference count. If this is not what you want, use C instead. For example, imagine you want to return a reference from an XSUB function. Inside the XSUB routine, you create an SV which initially has a reference count of one. Then you call C, passing it the just-created SV. This returns the reference as a new SV, but the reference count of the SV you passed to C has been incremented to two. Now you return the reference from the XSUB routine and forget about the SV. But Perl hasn't! Whenever the returned reference is destroyed, the reference count of the original SV is decreased to one and nothing happens. The SV will hang around without any way to access it until Perl itself terminates. This is a memory leak. The correct procedure, then, is to use C instead of C. Then, if and when the last reference is destroyed, the reference count of the SV will go to zero and it will be destroyed, stopping any memory leak. There are some convenience functions available that can help with the destruction of xVs. These functions introduce the concept of "mortality". An xV that is mortal has had its reference count marked to be decremented, but not actually decremented, until "a short time later". Generally the term "short time later" means a single Perl statement, such as a call to an XSUB function. The actual determinant for when mortal xVs have their reference count decremented depends on two macros, SAVETMPS and FREETMPS. See L and L for more details on these macros. "Mortalization" then is at its simplest a deferred C. However, if you mortalize a variable twice, the reference count will later be decremented twice. "Mortal" SVs are mainly used for SVs that are placed on perl's stack. For example an SV which is created just to pass a number to a called sub is made mortal to have it cleaned up automatically when it's popped off the stack. Similarly, results returned by XSUBs (which are pushed on the stack) are often made mortal. To create a mortal variable, use the functions: SV* sv_newmortal() SV* sv_2mortal(SV*) SV* sv_mortalcopy(SV*) The first call creates a mortal SV (with no value), the second converts an existing SV to a mortal SV (and thus defers a call to C), and the third creates a mortal copy of an existing SV. Because C gives the new SV no value, it must normally be given one via C, C, etc. : SV *tmp = sv_newmortal(); sv_setiv(tmp, an_integer); As that is multiple C statements it is quite common so see this idiom instead: SV *tmp = sv_2mortal(newSViv(an_integer)); You should be careful about creating mortal variables. Strange things can happen if you make the same value mortal within multiple contexts, or if you make a variable mortal multiple times. Thinking of "Mortalization" as deferred C should help to minimize such problems. For example if you are passing an SV which you I has a high enough REFCNT to survive its use on the stack you need not do any mortalization. If you are not sure then doing an C and C, or making a C is safer. The mortal routines are not just for SVs; AVs and HVs can be made mortal by passing their address (type-casted to C) to the C or C routines. =head2 Stashes and Globs A B is a hash that contains all variables that are defined within a package. Each key of the stash is a symbol name (shared by all the different types of objects that have the same name), and each value in the hash table is a GV (Glob Value). This GV in turn contains references to the various objects of that name, including (but not limited to) the following: Scalar Value Array Value Hash Value I/O Handle Format Subroutine There is a single stash called C that holds the items that exist in the C
package. To get at the items in other packages, append the string "::" to the package name. The items in the C package are in the stash C in PL_defstash. The items in the C package are in the stash C in C's stash. To get the stash pointer for a particular package, use the function: HV* gv_stashpv(const char* name, I32 flags) HV* gv_stashsv(SV*, I32 flags) The first function takes a literal string, the second uses the string stored in the SV. Remember that a stash is just a hash table, so you get back an C. The C flag will create a new package if it is set to GV_ADD. The name that C wants is the name of the package whose symbol table you want. The default package is called C
. If you have multiply nested packages, pass their names to C, separated by C<::> as in the Perl language itself. Alternately, if you have an SV that is a blessed reference, you can find out the stash pointer by using: HV* SvSTASH(SvRV(SV*)); then use the following to get the package name itself: char* HvNAME(HV* stash); If you need to bless or re-bless an object you can use the following function: SV* sv_bless(SV*, HV* stash) where the first argument, an C, must be a reference, and the second argument is a stash. The returned C can now be used in the same way as any other SV. For more information on references and blessings, consult L. =head2 Double-Typed SVs Scalar variables normally contain only one type of value, an integer, double, pointer, or reference. Perl will automatically convert the actual scalar data from the stored type into the requested type. Some scalar variables contain more than one type of scalar data. For example, the variable C<$!> contains either the numeric value of C or its string equivalent from either C or C. To force multiple data values into an SV, you must do two things: use the C routines to add the additional scalar type, then set a flag so that Perl will believe it contains more than one type of data. The four macros to set the flags are: SvIOK_on SvNOK_on SvPOK_on SvROK_on The particular macro you must use depends on which C routine you called first. This is because every C routine turns on only the bit for the particular type of data being set, and turns off all the rest. For example, to create a new Perl variable called "dberror" that contains both the numeric and descriptive string error values, you could use the following code: extern int dberror; extern char *dberror_list; SV* sv = get_sv("dberror", GV_ADD); sv_setiv(sv, (IV) dberror); sv_setpv(sv, dberror_list[dberror]); SvIOK_on(sv); If the order of C and C had been reversed, then the macro C would need to be called instead of C. =head2 Magic Variables [This section still under construction. Ignore everything here. Post no bills. Everything not permitted is forbidden.] Any SV may be magical, that is, it has special features that a normal SV does not have. These features are stored in the SV structure in a linked list of C's, typedef'ed to C. struct magic { MAGIC* mg_moremagic; MGVTBL* mg_virtual; U16 mg_private; char mg_type; U8 mg_flags; I32 mg_len; SV* mg_obj; char* mg_ptr; }; Note this is current as of patchlevel 0, and could change at any time. =head2 Assigning Magic Perl adds magic to an SV using the sv_magic function: void sv_magic(SV* sv, SV* obj, int how, const char* name, I32 namlen); The C argument is a pointer to the SV that is to acquire a new magical feature. If C is not already magical, Perl uses the C macro to convert C to type C. Perl then continues by adding new magic to the beginning of the linked list of magical features. Any prior entry of the same type of magic is deleted. Note that this can be overridden, and multiple instances of the same type of magic can be associated with an SV. The C and C arguments are used to associate a string with the magic, typically the name of a variable. C is stored in the C field and if C is non-null then either a C copy of C or C itself is stored in the C field, depending on whether C is greater than zero or equal to zero respectively. As a special case, if C<(name && namlen == HEf_SVKEY)> then C is assumed to contain an C and is stored as-is with its REFCNT incremented. The sv_magic function uses C to determine which, if any, predefined "Magic Virtual Table" should be assigned to the C field. See the L section below. The C argument is also stored in the C field. The value of C should be chosen from the set of macros C found in F. Note that before these macros were added, Perl internals used to directly use character literals, so you may occasionally come across old code or documentation referring to 'U' magic rather than C for example. The C argument is stored in the C field of the C structure. If it is not the same as the C argument, the reference count of the C object is incremented. If it is the same, or if the C argument is C, or if it is a NULL pointer, then C is merely stored, without the reference count being incremented. See also C in L for a more flexible way to add magic to an SV. There is also a function to add magic to an C: void hv_magic(HV *hv, GV *gv, int how); This simply calls C and coerces the C argument into an C. To remove the magic from an SV, call the function sv_unmagic: int sv_unmagic(SV *sv, int type); The C argument should be equal to the C value when the C was initially made magical. However, note that C removes all magic of a certain C from the C. If you want to remove only certain magic of a C based on the magic virtual table, use C instead: int sv_unmagicext(SV *sv, int type, MGVTBL *vtbl); =head2 Magic Virtual Tables The C field in the C structure is a pointer to an C, which is a structure of function pointers and stands for "Magic Virtual Table" to handle the various operations that might be applied to that variable. The C has five (or sometimes eight) pointers to the following routine types: int (*svt_get)(SV* sv, MAGIC* mg); int (*svt_set)(SV* sv, MAGIC* mg); U32 (*svt_len)(SV* sv, MAGIC* mg); int (*svt_clear)(SV* sv, MAGIC* mg); int (*svt_free)(SV* sv, MAGIC* mg); int (*svt_copy)(SV *sv, MAGIC* mg, SV *nsv, const char *name, I32 namlen); int (*svt_dup)(MAGIC *mg, CLONE_PARAMS *param); int (*svt_local)(SV *nsv, MAGIC *mg); This MGVTBL structure is set at compile-time in F and there are currently 32 types. These different structures contain pointers to various routines that perform additional actions depending on which function is being called. Function pointer Action taken ---------------- ------------ svt_get Do something before the value of the SV is retrieved. svt_set Do something after the SV is assigned a value. svt_len Report on the SV's length. svt_clear Clear something the SV represents. svt_free Free any extra storage associated with the SV. svt_copy copy tied variable magic to a tied element svt_dup duplicate a magic structure during thread cloning svt_local copy magic to local value during 'local' For instance, the MGVTBL structure called C (which corresponds to an C of C) contains: { magic_get, magic_set, magic_len, 0, 0 } Thus, when an SV is determined to be magical and of type C, if a get operation is being performed, the routine C is called. All the various routines for the various magical types begin with C. NOTE: the magic routines are not considered part of the Perl API, and may not be exported by the Perl library. The last three slots are a recent addition, and for source code compatibility they are only checked for if one of the three flags MGf_COPY, MGf_DUP or MGf_LOCAL is set in mg_flags. This means that most code can continue declaring a vtable as a 5-element value. These three are currently used exclusively by the threading code, and are highly subject to change. The current kinds of Magic Virtual Tables are: =for comment This table is generated by regen/mg_vtable.pl. Any changes made here will be lost. =for mg_vtable.pl begin mg_type (old-style char and macro) MGVTBL Type of magic -------------------------- ------ ------------- \0 PERL_MAGIC_sv vtbl_sv Special scalar variable # PERL_MAGIC_arylen vtbl_arylen Array length ($#ary) % PERL_MAGIC_rhash (none) extra data for restricted hashes . PERL_MAGIC_pos vtbl_pos pos() lvalue : PERL_MAGIC_symtab (none) extra data for symbol tables < PERL_MAGIC_backref vtbl_backref for weak ref data @ PERL_MAGIC_arylen_p (none) to move arylen out of XPVAV A PERL_MAGIC_overload vtbl_amagic %OVERLOAD hash a PERL_MAGIC_overload_elem vtbl_amagicelem %OVERLOAD hash element B PERL_MAGIC_bm vtbl_regexp Boyer-Moore (fast string search) c PERL_MAGIC_overload_table vtbl_ovrld Holds overload table (AMT) on stash D PERL_MAGIC_regdata vtbl_regdata Regex match position data (@+ and @- vars) d PERL_MAGIC_regdatum vtbl_regdatum Regex match position data element E PERL_MAGIC_env vtbl_env %ENV hash e PERL_MAGIC_envelem vtbl_envelem %ENV hash element f PERL_MAGIC_fm vtbl_regdata Formline ('compiled' format) G PERL_MAGIC_study vtbl_regexp study()ed string g PERL_MAGIC_regex_global vtbl_mglob m//g target H PERL_MAGIC_hints vtbl_hints %^H hash h PERL_MAGIC_hintselem vtbl_hintselem %^H hash element I PERL_MAGIC_isa vtbl_isa @ISA array i PERL_MAGIC_isaelem vtbl_isaelem @ISA array element k PERL_MAGIC_nkeys vtbl_nkeys scalar(keys()) lvalue L PERL_MAGIC_dbfile (none) Debugger %_ and C magic types are defined specifically for use by extensions and will not be used by perl itself. Extensions can use C magic to 'attach' private information to variables (typically objects). This is especially useful because there is no way for normal perl code to corrupt this private information (unlike using extra elements of a hash object). Similarly, C magic can be used much like tie() to call a C function any time a scalar's value is used or changed. The C's C field points to a C structure: struct ufuncs { I32 (*uf_val)(pTHX_ IV, SV*); I32 (*uf_set)(pTHX_ IV, SV*); IV uf_index; }; When the SV is read from or written to, the C or C function will be called with C as the first arg and a pointer to the SV as the second. A simple example of how to add C magic is shown below. Note that the ufuncs structure is copied by sv_magic, so you can safely allocate it on the stack. void Umagic(sv) SV *sv; PREINIT: struct ufuncs uf; CODE: uf.uf_val = &my_get_fn; uf.uf_set = &my_set_fn; uf.uf_index = 0; sv_magic(sv, 0, PERL_MAGIC_uvar, (char*)&uf, sizeof(uf)); Attaching C to arrays is permissible but has no effect. For hashes there is a specialized hook that gives control over hash keys (but not values). This hook calls C 'get' magic if the "set" function in the C structure is NULL. The hook is activated whenever the hash is accessed with a key specified as an C through the functions C, C, C, and C. Accessing the key as a string through the functions without the C<..._ent> suffix circumvents the hook. See L for a detailed description. Note that because multiple extensions may be using C or C magic, it is important for extensions to take extra care to avoid conflict. Typically only using the magic on objects blessed into the same class as the extension is sufficient. For C magic, it is usually a good idea to define an C, even if all its fields will be C<0>, so that individual C pointers can be identified as a particular kind of magic using their magic virtual table. C provides an easy way to do that: STATIC MGVTBL my_vtbl = { 0, 0, 0, 0, 0, 0, 0, 0 }; MAGIC *mg; if ((mg = mg_findext(sv, PERL_MAGIC_ext, &my_vtbl))) { /* this is really ours, not another module's PERL_MAGIC_ext */ my_priv_data_t *priv = (my_priv_data_t *)mg->mg_ptr; ... } Also note that the C and C functions described earlier do B invoke 'set' magic on their targets. This must be done by the user either by calling the C macro after calling these functions, or by using one of the C or C functions. Similarly, generic C code must call the C macro to invoke any 'get' magic if they use an SV obtained from external sources in functions that don't handle magic. See L for a description of these functions. For example, calls to the C functions typically need to be followed by C, but they don't need a prior C since their implementation handles 'get' magic. =head2 Finding Magic MAGIC *mg_find(SV *sv, int type); /* Finds the magic pointer of that * type */ This routine returns a pointer to a C structure stored in the SV. If the SV does not have that magical feature, C is returned. If the SV has multiple instances of that magical feature, the first one will be returned. C can be used to find a C structure of an SV based on both its magic type and its magic virtual table: MAGIC *mg_findext(SV *sv, int type, MGVTBL *vtbl); Also, if the SV passed to C or C is not of type SVt_PVMG, Perl may core dump. int mg_copy(SV* sv, SV* nsv, const char* key, STRLEN klen); This routine checks to see what types of magic C has. If the mg_type field is an uppercase letter, then the mg_obj is copied to C, but the mg_type field is changed to be the lowercase letter. =head2 Understanding the Magic of Tied Hashes and Arrays Tied hashes and arrays are magical beasts of the C magic type. WARNING: As of the 5.004 release, proper usage of the array and hash access functions requires understanding a few caveats. Some of these caveats are actually considered bugs in the API, to be fixed in later releases, and are bracketed with [MAYCHANGE] below. If you find yourself actually applying such information in this section, be aware that the behavior may change in the future, umm, without warning. The perl tie function associates a variable with an object that implements the various GET, SET, etc methods. To perform the equivalent of the perl tie function from an XSUB, you must mimic this behaviour. The code below carries out the necessary steps - firstly it creates a new hash, and then creates a second hash which it blesses into the class which will implement the tie methods. Lastly it ties the two hashes together, and returns a reference to the new tied hash. Note that the code below does NOT call the TIEHASH method in the MyTie class - see L for details on how to do this. SV* mytie() PREINIT: HV *hash; HV *stash; SV *tie; CODE: hash = newHV(); tie = newRV_noinc((SV*)newHV()); stash = gv_stashpv("MyTie", GV_ADD); sv_bless(tie, stash); hv_magic(hash, (GV*)tie, PERL_MAGIC_tied); RETVAL = newRV_noinc(hash); OUTPUT: RETVAL The C function, when given a tied array argument, merely copies the magic of the array onto the value to be "stored", using C. It may also return NULL, indicating that the value did not actually need to be stored in the array. [MAYCHANGE] After a call to C on a tied array, the caller will usually need to call C to actually invoke the perl level "STORE" method on the TIEARRAY object. If C did return NULL, a call to C will also be usually necessary to avoid a memory leak. [/MAYCHANGE] The previous paragraph is applicable verbatim to tied hash access using the C and C functions as well. C and the corresponding hash functions C and C actually return an undefined mortal value whose magic has been initialized using C. Note the value so returned does not need to be deallocated, as it is already mortal. [MAYCHANGE] But you will need to call C on the returned value in order to actually invoke the perl level "FETCH" method on the underlying TIE object. Similarly, you may also call C on the return value after possibly assigning a suitable value to it using C, which will invoke the "STORE" method on the TIE object. [/MAYCHANGE] [MAYCHANGE] In other words, the array or hash fetch/store functions don't really fetch and store actual values in the case of tied arrays and hashes. They merely call C to attach magic to the values that were meant to be "stored" or "fetched". Later calls to C and C actually do the job of invoking the TIE methods on the underlying objects. Thus the magic mechanism currently implements a kind of lazy access to arrays and hashes. Currently (as of perl version 5.004), use of the hash and array access functions requires the user to be aware of whether they are operating on "normal" hashes and arrays, or on their tied variants. The API may be changed to provide more transparent access to both tied and normal data types in future versions. [/MAYCHANGE] You would do well to understand that the TIEARRAY and TIEHASH interfaces are mere sugar to invoke some perl method calls while using the uniform hash and array syntax. The use of this sugar imposes some overhead (typically about two to four extra opcodes per FETCH/STORE operation, in addition to the creation of all the mortal variables required to invoke the methods). This overhead will be comparatively small if the TIE methods are themselves substantial, but if they are only a few statements long, the overhead will not be insignificant. =head2 Localizing changes Perl has a very handy construction { local $var = 2; ... } This construction is I equivalent to { my $oldvar = $var; $var = 2; ... $var = $oldvar; } The biggest difference is that the first construction would reinstate the initial value of $var, irrespective of how control exits the block: C, C, C/C, etc. It is a little bit more efficient as well. There is a way to achieve a similar task from C via Perl API: create a I, and arrange for some changes to be automatically undone at the end of it, either explicit, or via a non-local exit (via die()). A I-like construct is created by a pair of C/C macros (see L). Such a construct may be created specially for some important localized task, or an existing one (like boundaries of enclosing Perl subroutine/block, or an existing pair for freeing TMPs) may be used. (In the second case the overhead of additional localization must be almost negligible.) Note that any XSUB is automatically enclosed in an C/C pair. Inside such a I the following service is available: =over 4 =item C =item C =item C =item C These macros arrange things to restore the value of integer variable C at the end of enclosing I. =item C =item C These macros arrange things to restore the value of pointers C and C

. C must be a pointer of a type which survives conversion to C and back, C

should be able to survive conversion to C and back. =item C The refcount of C would be decremented at the end of I. This is similar to C in that it is also a mechanism for doing a delayed C. However, while C extends the lifetime of C until the beginning of the next statement, C extends it until the end of the enclosing scope. These lifetimes can be wildly different. Also compare C. =item C Just like C, but mortalizes C at the end of the current scope instead of decrementing its reference count. This usually has the effect of keeping C alive until the statement that called the currently live scope has finished executing. =item C The C is op_free()ed at the end of I. =item C The chunk of memory which is pointed to by C

is Safefree()ed at the end of I. =item C Clears a slot in the current scratchpad which corresponds to C at the end of I. =item C The key C of C is deleted at the end of I. The string pointed to by C is Safefree()ed. If one has a I in short-lived storage, the corresponding string may be reallocated like this: SAVEDELETE(PL_defstash, savepv(tmpbuf), strlen(tmpbuf)); =item C At the end of I the function C is called with the only argument C

. =item C At the end of I the function C is called with the implicit context argument (if any), and C

. =item C The current offset on the Perl internal stack (cf. C) is restored at the end of I. =back The following API list contains functions, thus one needs to provide pointers to the modifiable data explicitly (either C pointers, or Perlish Cs). Where the above macros take C, a similar function takes C. =over 4 =item C Equivalent to Perl code C. =item C =item C Similar to C, but localize C<@gv> and C<%gv>. =item C Duplicates the current value of C, on the exit from the current C/C I will restore the value of C using the stored value. It doesn't handle magic. Use C if magic is affected. =item C A variant of C which takes multiple arguments via an array C of C of length C. =item C Similar to C, but will reinstate an C. =item C =item C Similar to C, but localize C and C. =back The C module implements localization of the basic types within the I. People who are interested in how to localize things in the containing scope should take a look there too. =head1 Subroutines =head2 XSUBs and the Argument Stack The XSUB mechanism is a simple way for Perl programs to access C subroutines. An XSUB routine will have a stack that contains the arguments from the Perl program, and a way to map from the Perl data structures to a C equivalent. The stack arguments are accessible through the C macro, which returns the C'th stack argument. Argument 0 is the first argument passed in the Perl subroutine call. These arguments are C, and can be used anywhere an C is used. Most of the time, output from the C routine can be handled through use of the RETVAL and OUTPUT directives. However, there are some cases where the argument stack is not already long enough to handle all the return values. An example is the POSIX tzname() call, which takes no arguments, but returns two, the local time zone's standard and summer time abbreviations. To handle this situation, the PPCODE directive is used and the stack is extended using the macro: EXTEND(SP, num); where C is the macro that represents the local copy of the stack pointer, and C is the number of elements the stack should be extended by. Now that there is room on the stack, values can be pushed on it using C macro. The pushed values will often need to be "mortal" (See L): PUSHs(sv_2mortal(newSViv(an_integer))) PUSHs(sv_2mortal(newSVuv(an_unsigned_integer))) PUSHs(sv_2mortal(newSVnv(a_double))) PUSHs(sv_2mortal(newSVpv("Some String",0))) /* Although the last example is better written as the more * efficient: */ PUSHs(newSVpvs_flags("Some String", SVs_TEMP)) And now the Perl program calling C, the two values will be assigned as in: ($standard_abbrev, $summer_abbrev) = POSIX::tzname; An alternate (and possibly simpler) method to pushing values on the stack is to use the macro: XPUSHs(SV*) This macro automatically adjusts the stack for you, if needed. Thus, you do not need to call C to extend the stack. Despite their suggestions in earlier versions of this document the macros C<(X)PUSH[iunp]> are I suited to XSUBs which return multiple results. For that, either stick to the C<(X)PUSHs> macros shown above, or use the new C macros instead; see L. For more information, consult L and L. =head2 Autoloading with XSUBs If an AUTOLOAD routine is an XSUB, as with Perl subroutines, Perl puts the fully-qualified name of the autoloaded subroutine in the $AUTOLOAD variable of the XSUB's package. But it also puts the same information in certain fields of the XSUB itself: HV *stash = CvSTASH(cv); const char *subname = SvPVX(cv); STRLEN name_length = SvCUR(cv); /* in bytes */ U32 is_utf8 = SvUTF8(cv); C contains just the sub name itself, not including the package. For an AUTOLOAD routine in UNIVERSAL or one of its superclasses, C returns NULL during a method call on a nonexistent package. B: Setting $AUTOLOAD stopped working in 5.6.1, which did not support XS AUTOLOAD subs at all. Perl 5.8.0 introduced the use of fields in the XSUB itself. Perl 5.16.0 restored the setting of $AUTOLOAD. If you need to support 5.8-5.14, use the XSUB's fields. =head2 Calling Perl Routines from within C Programs There are four routines that can be used to call a Perl subroutine from within a C program. These four are: I32 call_sv(SV*, I32); I32 call_pv(const char*, I32); I32 call_method(const char*, I32); I32 call_argv(const char*, I32, register char**); The routine most often used is C. The C argument contains either the name of the Perl subroutine to be called, or a reference to the subroutine. The second argument consists of flags that control the context in which the subroutine is called, whether or not the subroutine is being passed arguments, how errors should be trapped, and how to treat return values. All four routines return the number of arguments that the subroutine returned on the Perl stack. These routines used to be called C, etc., before Perl v5.6.0, but those names are now deprecated; macros of the same name are provided for compatibility. When using any of these routines (except C), the programmer must manipulate the Perl stack. These include the following macros and functions: dSP SP PUSHMARK() PUTBACK SPAGAIN ENTER SAVETMPS FREETMPS LEAVE XPUSH*() POP*() For a detailed description of calling conventions from C to Perl, consult L. =head2 Memory Allocation =head3 Allocation All memory meant to be used with the Perl API functions should be manipulated using the macros described in this section. The macros provide the necessary transparency between differences in the actual malloc implementation that is used within perl. It is suggested that you enable the version of malloc that is distributed with Perl. It keeps pools of various sizes of unallocated memory in order to satisfy allocation requests more quickly. However, on some platforms, it may cause spurious malloc or free errors. The following three macros are used to initially allocate memory : Newx(pointer, number, type); Newxc(pointer, number, type, cast); Newxz(pointer, number, type); The first argument C should be the name of a variable that will point to the newly allocated memory. The second and third arguments C and C specify how many of the specified type of data structure should be allocated. The argument C is passed to C. The final argument to C, C, should be used if the C argument is different from the C argument. Unlike the C and C macros, the C macro calls C to zero out all the newly allocated memory. =head3 Reallocation Renew(pointer, number, type); Renewc(pointer, number, type, cast); Safefree(pointer) These three macros are used to change a memory buffer size or to free a piece of memory no longer needed. The arguments to C and C match those of C and C with the exception of not needing the "magic cookie" argument. =head3 Moving Move(source, dest, number, type); Copy(source, dest, number, type); Zero(dest, number, type); These three macros are used to move, copy, or zero out previously allocated memory. The C and C arguments point to the source and destination starting points. Perl will move, copy, or zero out C instances of the size of the C data structure (using the C function). =head2 PerlIO The most recent development releases of Perl have been experimenting with removing Perl's dependency on the "normal" standard I/O suite and allowing other stdio implementations to be used. This involves creating a new abstraction layer that then calls whichever implementation of stdio Perl was compiled with. All XSUBs should now use the functions in the PerlIO abstraction layer and not make any assumptions about what kind of stdio is being used. For a complete description of the PerlIO abstraction, consult L. =head2 Putting a C value on Perl stack A lot of opcodes (this is an elementary operation in the internal perl stack machine) put an SV* on the stack. However, as an optimization the corresponding SV is (usually) not recreated each time. The opcodes reuse specially assigned SVs (Is) which are (as a corollary) not constantly freed/created. Each of the targets is created only once (but see L below), and when an opcode needs to put an integer, a double, or a string on stack, it just sets the corresponding parts of its I and puts the I on stack. The macro to put this target on stack is C, and it is directly used in some opcodes, as well as indirectly in zillions of others, which use it via C<(X)PUSH[iunp]>. Because the target is reused, you must be careful when pushing multiple values on the stack. The following code will not do what you think: XPUSHi(10); XPUSHi(20); This translates as "set C to 10, push a pointer to C onto the stack; set C to 20, push a pointer to C onto the stack". At the end of the operation, the stack does not contain the values 10 and 20, but actually contains two pointers to C, which we have set to 20. If you need to push multiple different values then you should either use the C<(X)PUSHs> macros, or else use the new C macros, none of which make use of C. The C<(X)PUSHs> macros simply push an SV* on the stack, which, as noted under L, will often need to be "mortal". The new C macros make this a little easier to achieve by creating a new mortal for you (via C<(X)PUSHmortal>), pushing that onto the stack (extending it if necessary in the case of the C macros), and then setting its value. Thus, instead of writing this to "fix" the example above: XPUSHs(sv_2mortal(newSViv(10))) XPUSHs(sv_2mortal(newSViv(20))) you can simply write: mXPUSHi(10) mXPUSHi(20) On a related note, if you do use C<(X)PUSH[iunp]>, then you're going to need a C in your variable declarations so that the C<*PUSH*> macros can make use of the local variable C. See also C and C. =head2 Scratchpads The question remains on when the SVs which are Is for opcodes are created. The answer is that they are created when the current unit--a subroutine or a file (for opcodes for statements outside of subroutines)--is compiled. During this time a special anonymous Perl array is created, which is called a scratchpad for the current unit. A scratchpad keeps SVs which are lexicals for the current unit and are targets for opcodes. One can deduce that an SV lives on a scratchpad by looking on its flags: lexicals have C set, and Is have C set. The correspondence between OPs and Is is not 1-to-1. Different OPs in the compile tree of the unit can use the same target, if this would not conflict with the expected life of the temporary. =head2 Scratchpads and recursion In fact it is not 100% true that a compiled unit contains a pointer to the scratchpad AV. In fact it contains a pointer to an AV of (initially) one element, and this element is the scratchpad AV. Why do we need an extra level of indirection? The answer is B, and maybe B. Both these can create several execution pointers going into the same subroutine. For the subroutine-child not write over the temporaries for the subroutine-parent (lifespan of which covers the call to the child), the parent and the child should have different scratchpads. (I the lexicals should be separate anyway!) So each subroutine is born with an array of scratchpads (of length 1). On each entry to the subroutine it is checked that the current depth of the recursion is not more than the length of this array, and if it is, new scratchpad is created and pushed into the array. The Is on this scratchpad are Cs, but they are already marked with correct flags. =head1 Compiled code =head2 Code tree Here we describe the internal form your code is converted to by Perl. Start with a simple example: $a = $b + $c; This is converted to a tree similar to this one: assign-to / \ + $a / \ $b $c (but slightly more complicated). This tree reflects the way Perl parsed your code, but has nothing to do with the execution order. There is an additional "thread" going through the nodes of the tree which shows the order of execution of the nodes. In our simplified example above it looks like: $b ---> $c ---> + ---> $a ---> assign-to But with the actual compile tree for C<$a = $b + $c> it is different: some nodes I. As a corollary, though the actual tree contains more nodes than our simplified example, the execution order is the same as in our example. =head2 Examining the tree If you have your perl compiled for debugging (usually done with C<-DDEBUGGING> on the C command line), you may examine the compiled tree by specifying C<-Dx> on the Perl command line. The output takes several lines per node, and for C<$b+$c> it looks like this: 5 TYPE = add ===> 6 TARG = 1 FLAGS = (SCALAR,KIDS) { TYPE = null ===> (4) (was rv2sv) FLAGS = (SCALAR,KIDS) { 3 TYPE = gvsv ===> 4 FLAGS = (SCALAR) GV = main::b } } { TYPE = null ===> (5) (was rv2sv) FLAGS = (SCALAR,KIDS) { 4 TYPE = gvsv ===> 5 FLAGS = (SCALAR) GV = main::c } } This tree has 5 nodes (one per C specifier), only 3 of them are not optimized away (one per number in the left column). The immediate children of the given node correspond to C<{}> pairs on the same level of indentation, thus this listing corresponds to the tree: add / \ null null | | gvsv gvsv The execution order is indicated by C<===E> marks, thus it is C<3 4 5 6> (node C<6> is not included into above listing), i.e., C. Each of these nodes represents an op, a fundamental operation inside the Perl core. The code which implements each operation can be found in the F files; the function which implements the op with type C is C, and so on. As the tree above shows, different ops have different numbers of children: C is a binary operator, as one would expect, and so has two children. To accommodate the various different numbers of children, there are various types of op data structure, and they link together in different ways. The simplest type of op structure is C: this has no children. Unary operators, Cs, have one child, and this is pointed to by the C field. Binary operators (Cs) have not only an C field but also an C field. The most complex type of op is a C, which has any number of children. In this case, the first child is pointed to by C and the last child by C. The children in between can be found by iteratively following the C pointer from the first child to the last. There are also two other op types: a C holds a regular expression, and has no children, and a C may or may not have children. If the C field is non-zero, it behaves like a C. To complicate matters, if a C is actually a C op after optimization (see L) it will still have children in accordance with its former type. Another way to examine the tree is to use a compiler back-end module, such as L. =head2 Compile pass 1: check routines The tree is created by the compiler while I code feeds it the constructions it recognizes. Since I works bottom-up, so does the first pass of perl compilation. What makes this pass interesting for perl developers is that some optimization may be performed on this pass. This is optimization by so-called "check routines". The correspondence between node names and corresponding check routines is described in F (do not forget to run C if you modify this file). A check routine is called when the node is fully constructed except for the execution-order thread. Since at this time there are no back-links to the currently constructed node, one can do most any operation to the top-level node, including freeing it and/or creating new nodes above/below it. The check routine returns the node which should be inserted into the tree (if the top-level node was not modified, check routine returns its argument). By convention, check routines have names C. They are usually called from C subroutines (or C) (which in turn are called from F). =head2 Compile pass 1a: constant folding Immediately after the check routine is called the returned node is checked for being compile-time executable. If it is (the value is judged to be constant) it is immediately executed, and a I node with the "return value" of the corresponding subtree is substituted instead. The subtree is deleted. If constant folding was not performed, the execution-order thread is created. =head2 Compile pass 2: context propagation When a context for a part of compile tree is known, it is propagated down through the tree. At this time the context can have 5 values (instead of 2 for runtime context): void, boolean, scalar, list, and lvalue. In contrast with the pass 1 this pass is processed from top to bottom: a node's context determines the context for its children. Additional context-dependent optimizations are performed at this time. Since at this moment the compile tree contains back-references (via "thread" pointers), nodes cannot be free()d now. To allow optimized-away nodes at this stage, such nodes are null()ified instead of free()ing (i.e. their type is changed to OP_NULL). =head2 Compile pass 3: peephole optimization After the compile tree for a subroutine (or for an C or a file) is created, an additional pass over the code is performed. This pass is neither top-down or bottom-up, but in the execution order (with additional complications for conditionals). Optimizations performed at this stage are subject to the same restrictions as in the pass 2. Peephole optimizations are done by calling the function pointed to by the global variable C. By default, C just calls the function pointed to by the global variable C. By default, that performs some basic op fixups and optimisations along the execution-order op chain, and recursively calls C for each side chain of ops (resulting from conditionals). Extensions may provide additional optimisations or fixups, hooking into either the per-subroutine or recursive stage, like this: static peep_t prev_peepp; static void my_peep(pTHX_ OP *o) { /* custom per-subroutine optimisation goes here */ prev_peepp(o); /* custom per-subroutine optimisation may also go here */ } BOOT: prev_peepp = PL_peepp; PL_peepp = my_peep; static peep_t prev_rpeepp; static void my_rpeep(pTHX_ OP *o) { OP *orig_o = o; for(; o; o = o->op_next) { /* custom per-op optimisation goes here */ } prev_rpeepp(orig_o); } BOOT: prev_rpeepp = PL_rpeepp; PL_rpeepp = my_rpeep; =head2 Pluggable runops The compile tree is executed in a runops function. There are two runops functions, in F and in F. C is used with DEBUGGING and C is used otherwise. For fine control over the execution of the compile tree it is possible to provide your own runops function. It's probably best to copy one of the existing runops functions and change it to suit your needs. Then, in the BOOT section of your XS file, add the line: PL_runops = my_runops; This function should be as efficient as possible to keep your programs running as fast as possible. =head2 Compile-time scope hooks As of perl 5.14 it is possible to hook into the compile-time lexical scope mechanism using C. This is used like this: STATIC void my_start_hook(pTHX_ int full); STATIC BHK my_hooks; BOOT: BhkENTRY_set(&my_hooks, bhk_start, my_start_hook); Perl_blockhook_register(aTHX_ &my_hooks); This will arrange to have C called at the start of compiling every lexical scope. The available hooks are: =over 4 =item C This is called just after starting a new lexical scope. Note that Perl code like if ($x) { ... } creates two scopes: the first starts at the C<(> and has C, the second starts at the C<{> and has C. Both end at the C<}>, so calls to C and C

 will match. Anything
pushed onto the save stack by this hook will be popped just before the
scope ends (between the C and C hooks, in fact).

=item C

This is called at the end of a lexical scope, just before unwinding the
stack. I is the root of the optree representing the scope; it is a
double pointer so you can replace the OP if you need to.

=item C

This is called at the end of a lexical scope, just after unwinding the
stack. I is as above. Note that it is possible for calls to C
and C to nest, if there is something on the save stack that
calls string eval.

=item C

This is called just before starting to compile an C, C, C or C, after the eval has been set up. I is the
OP that requested the eval, and will normally be an C,
C or C.

=back

Once you have your hook functions, you need a C structure to put
them in. It's best to allocate it statically, since there is no way to
free it once it's registered. The function pointers should be inserted
into this structure using the C macro, which will also set
flags indicating which entries are valid. If you do need to allocate
your C dynamically for some reason, be sure to zero it before you
start.

Once registered, there is no mechanism to switch these hooks off, so if
that is necessary you will need to do this yourself. An entry in C<%^H>
is probably the best way, so the effect is lexically scoped; however it
is also possible to use the C and C macros to
temporarily switch entries on and off. You should also be aware that
generally speaking at least one scope will have opened before your
extension is loaded, so you will see some C
 pairs that
didn't have a matching C.

=head1 Examining internal data structures with the C functions

To aid debugging, the source file F contains a number of
functions which produce formatted output of internal data structures.

The most commonly used of these functions is C; it's used
for dumping SVs, AVs, HVs, and CVs. The C module calls
C to produce debugging output from Perl-space, so users of that
module should already be familiar with its format.

C can be used to dump an C structure or any of its
derivatives, and produces output similar to C; in fact,
C will dump the main root of the code being evaluated,
exactly like C<-Dx>.

Other useful functions are C, which turns a C into an
op tree, C which calls C on all the
subroutines in a package like so: (Thankfully, these are all xsubs, so
there is no op tree)

    (gdb) print Perl_dump_packsubs(PL_defstash)

    SUB attributes::bootstrap = (xsub 0x811fedc 0)

    SUB UNIVERSAL::can = (xsub 0x811f50c 0)

    SUB UNIVERSAL::isa = (xsub 0x811f304 0)

    SUB UNIVERSAL::VERSION = (xsub 0x811f7ac 0)

    SUB DynaLoader::boot_DynaLoader = (xsub 0x805b188 0)

and C, which dumps all the subroutines in the stash and
the op tree of the main root.

=head1 How multiple interpreters and concurrency are supported

=head2 Background and PERL_IMPLICIT_CONTEXT

The Perl interpreter can be regarded as a closed box: it has an API
for feeding it code or otherwise making it do things, but it also has
functions for its own use.  This smells a lot like an object, and
there are ways for you to build Perl so that you can have multiple
interpreters, with one interpreter represented either as a C structure,
or inside a thread-specific structure.  These structures contain all
the context, the state of that interpreter.

One macro controls the major Perl build flavor: MULTIPLICITY. The
MULTIPLICITY build has a C structure that packages all the interpreter
state. With multiplicity-enabled perls, PERL_IMPLICIT_CONTEXT is also
normally defined, and enables the support for passing in a "hidden" first
argument that represents all three data structures. MULTIPLICITY makes
multi-threaded perls possible (with the ithreads threading model, related
to the macro USE_ITHREADS.)

Two other "encapsulation" macros are the PERL_GLOBAL_STRUCT and
PERL_GLOBAL_STRUCT_PRIVATE (the latter turns on the former, and the
former turns on MULTIPLICITY.)  The PERL_GLOBAL_STRUCT causes all the
internal variables of Perl to be wrapped inside a single global struct,
struct perl_vars, accessible as (globals) &PL_Vars or PL_VarsPtr or
the function  Perl_GetVars().  The PERL_GLOBAL_STRUCT_PRIVATE goes
one step further, there is still a single struct (allocated in main()
either from heap or from stack) but there are no global data symbols
pointing to it.  In either case the global struct should be initialised
as the very first thing in main() using Perl_init_global_struct() and
correspondingly tear it down after perl_free() using Perl_free_global_struct(),
please see F for usage details.  You may also need
to use C in your coding to "declare the global variables"
when you are using them.  dTHX does this for you automatically.

To see whether you have non-const data you can use a BSD-compatible C:

  nm libperl.a | grep -v ' [TURtr] '

If this displays any C or C symbols, you have non-const data.

For backward compatibility reasons defining just PERL_GLOBAL_STRUCT
doesn't actually hide all symbols inside a big global struct: some
PerlIO_xxx vtables are left visible.  The PERL_GLOBAL_STRUCT_PRIVATE
then hides everything (see how the PERLIO_FUNCS_DECL is used).

All this obviously requires a way for the Perl internal functions to be
either subroutines taking some kind of structure as the first
argument, or subroutines taking nothing as the first argument.  To
enable these two very different ways of building the interpreter,
the Perl source (as it does in so many other situations) makes heavy
use of macros and subroutine naming conventions.

First problem: deciding which functions will be public API functions and
which will be private.  All functions whose names begin C are private
(think "S" for "secret" or "static").  All other functions begin with
"Perl_", but just because a function begins with "Perl_" does not mean it is
part of the API. (See L.) The easiest way to be B a
function is part of the API is to find its entry in L.
If it exists in L, it's part of the API.  If it doesn't, and you
think it should be (i.e., you need it for your extension), send mail via
L explaining why you think it should be.

Second problem: there must be a syntax so that the same subroutine
declarations and calls can pass a structure as their first argument,
or pass nothing.  To solve this, the subroutines are named and
declared in a particular way.  Here's a typical start of a static
function used within the Perl guts:

  STATIC void
  S_incline(pTHX_ char *s)

STATIC becomes "static" in C, and may be #define'd to nothing in some
configurations in the future.

A public function (i.e. part of the internal API, but not necessarily
sanctioned for use in extensions) begins like this:

  void
  Perl_sv_setiv(pTHX_ SV* dsv, IV num)

C is one of a number of macros (in F) that hide the
details of the interpreter's context.  THX stands for "thread", "this",
or "thingy", as the case may be.  (And no, George Lucas is not involved. :-)
The first character could be 'p' for a B

rototype, 'a' for Brgument, or 'd' for Beclaration, so we have C, C and C, and their variants. When Perl is built without options that set PERL_IMPLICIT_CONTEXT, there is no first argument containing the interpreter's context. The trailing underscore in the pTHX_ macro indicates that the macro expansion needs a comma after the context argument because other arguments follow it. If PERL_IMPLICIT_CONTEXT is not defined, pTHX_ will be ignored, and the subroutine is not prototyped to take the extra argument. The form of the macro without the trailing underscore is used when there are no additional explicit arguments. When a core function calls another, it must pass the context. This is normally hidden via macros. Consider C. It expands into something like this: #ifdef PERL_IMPLICIT_CONTEXT #define sv_setiv(a,b) Perl_sv_setiv(aTHX_ a, b) /* can't do this for vararg functions, see below */ #else #define sv_setiv Perl_sv_setiv #endif This works well, and means that XS authors can gleefully write: sv_setiv(foo, bar); and still have it work under all the modes Perl could have been compiled with. This doesn't work so cleanly for varargs functions, though, as macros imply that the number of arguments is known in advance. Instead we either need to spell them out fully, passing C as the first argument (the Perl core tends to do this with functions like Perl_warner), or use a context-free version. The context-free version of Perl_warner is called Perl_warner_nocontext, and does not take the extra argument. Instead it does dTHX; to get the context from thread-local storage. We C<#define warner Perl_warner_nocontext> so that extensions get source compatibility at the expense of performance. (Passing an arg is cheaper than grabbing it from thread-local storage.) You can ignore [pad]THXx when browsing the Perl headers/sources. Those are strictly for use within the core. Extensions and embedders need only be aware of [pad]THX. =head2 So what happened to dTHR? C was introduced in perl 5.005 to support the older thread model. The older thread model now uses the C mechanism to pass context pointers around, so C is not useful any more. Perl 5.6.0 and later still have it for backward source compatibility, but it is defined to be a no-op. =head2 How do I use all this in extensions? When Perl is built with PERL_IMPLICIT_CONTEXT, extensions that call any functions in the Perl API will need to pass the initial context argument somehow. The kicker is that you will need to write it in such a way that the extension still compiles when Perl hasn't been built with PERL_IMPLICIT_CONTEXT enabled. There are three ways to do this. First, the easy but inefficient way, which is also the default, in order to maintain source compatibility with extensions: whenever F is #included, it redefines the aTHX and aTHX_ macros to call a function that will return the context. Thus, something like: sv_setiv(sv, num); in your extension will translate to this when PERL_IMPLICIT_CONTEXT is in effect: Perl_sv_setiv(Perl_get_context(), sv, num); or to this otherwise: Perl_sv_setiv(sv, num); You don't have to do anything new in your extension to get this; since the Perl library provides Perl_get_context(), it will all just work. The second, more efficient way is to use the following template for your Foo.xs: #define PERL_NO_GET_CONTEXT /* we want efficiency */ #include "EXTERN.h" #include "perl.h" #include "XSUB.h" STATIC void my_private_function(int arg1, int arg2); STATIC void my_private_function(int arg1, int arg2) { dTHX; /* fetch context */ ... call many Perl API functions ... } [... etc ...] MODULE = Foo PACKAGE = Foo /* typical XSUB */ void my_xsub(arg) int arg CODE: my_private_function(arg, 10); Note that the only two changes from the normal way of writing an extension is the addition of a C<#define PERL_NO_GET_CONTEXT> before including the Perl headers, followed by a C declaration at the start of every function that will call the Perl API. (You'll know which functions need this, because the C compiler will complain that there's an undeclared identifier in those functions.) No changes are needed for the XSUBs themselves, because the XS() macro is correctly defined to pass in the implicit context if needed. The third, even more efficient way is to ape how it is done within the Perl guts: #define PERL_NO_GET_CONTEXT /* we want efficiency */ #include "EXTERN.h" #include "perl.h" #include "XSUB.h" /* pTHX_ only needed for functions that call Perl API */ STATIC void my_private_function(pTHX_ int arg1, int arg2); STATIC void my_private_function(pTHX_ int arg1, int arg2) { /* dTHX; not needed here, because THX is an argument */ ... call Perl API functions ... } [... etc ...] MODULE = Foo PACKAGE = Foo /* typical XSUB */ void my_xsub(arg) int arg CODE: my_private_function(aTHX_ arg, 10); This implementation never has to fetch the context using a function call, since it is always passed as an extra argument. Depending on your needs for simplicity or efficiency, you may mix the previous two approaches freely. Never add a comma after C yourself--always use the form of the macro with the underscore for functions that take explicit arguments, or the form without the argument for functions with no explicit arguments. If one is compiling Perl with the C<-DPERL_GLOBAL_STRUCT> the C definition is needed if the Perl global variables (see F or F) are accessed in the function and C is not used (the C includes the C if necessary). One notices the need for C only with the said compile-time define, because otherwise the Perl global variables are visible as-is. =head2 Should I do anything special if I call perl from multiple threads? If you create interpreters in one thread and then proceed to call them in another, you need to make sure perl's own Thread Local Storage (TLS) slot is initialized correctly in each of those threads. The C and C API functions will automatically set the TLS slot to the interpreter they created, so that there is no need to do anything special if the interpreter is always accessed in the same thread that created it, and that thread did not create or call any other interpreters afterwards. If that is not the case, you have to set the TLS slot of the thread before calling any functions in the Perl API on that particular interpreter. This is done by calling the C macro in that thread as the first thing you do: /* do this before doing anything else with some_perl */ PERL_SET_CONTEXT(some_perl); ... other Perl API calls on some_perl go here ... =head2 Future Plans and PERL_IMPLICIT_SYS Just as PERL_IMPLICIT_CONTEXT provides a way to bundle up everything that the interpreter knows about itself and pass it around, so too are there plans to allow the interpreter to bundle up everything it knows about the environment it's running on. This is enabled with the PERL_IMPLICIT_SYS macro. Currently it only works with USE_ITHREADS on Windows. This allows the ability to provide an extra pointer (called the "host" environment) for all the system calls. This makes it possible for all the system stuff to maintain their own state, broken down into seven C structures. These are thin wrappers around the usual system calls (see F) for the default perl executable, but for a more ambitious host (like the one that would do fork() emulation) all the extra work needed to pretend that different interpreters are actually different "processes", would be done here. The Perl engine/interpreter and the host are orthogonal entities. There could be one or more interpreters in a process, and one or more "hosts", with free association between them. =head1 Internal Functions All of Perl's internal functions which will be exposed to the outside world are prefixed by C so that they will not conflict with XS functions or functions used in a program in which Perl is embedded. Similarly, all global variables begin with C. (By convention, static functions start with C.) Inside the Perl core (C defined), you can get at the functions either with or without the C prefix, thanks to a bunch of defines that live in F. Note that extension code should I set C; this exposes the full perl internals, and is likely to cause breakage of the XS in each new perl release. The file F is generated automatically from F and F. F also creates the prototyping header files for the internal functions, generates the documentation and a lot of other bits and pieces. It's important that when you add a new function to the core or change an existing one, you change the data in the table in F as well. Here's a sample entry from that table: Apd |SV** |av_fetch |AV* ar|I32 key|I32 lval The second column is the return type, the third column the name. Columns after that are the arguments. The first column is a set of flags: =over 3 =item A This function is a part of the public API. All such functions should also have 'd', very few do not. =item p This function has a C prefix; i.e. it is defined as C. =item d This function has documentation using the C feature which we'll look at in a second. Some functions have 'd' but not 'A'; docs are good. =back Other available flags are: =over 3 =item s This is a static function and is defined as C, and usually called within the sources as C. =item n This does not need an interpreter context, so the definition has no C, and it follows that callers don't use C. (See L.) =item r This function never returns; C, C and friends. =item f This function takes a variable number of arguments, C style. The argument list should end with C<...>, like this: Afprd |void |croak |const char* pat|... =item M This function is part of the experimental development API, and may change or disappear without notice. =item o This function should not have a compatibility macro to define, say, C to C. It must be called as C. =item x This function isn't exported out of the Perl core. =item m This is implemented as a macro. =item X This function is explicitly exported. =item E This function is visible to extensions included in the Perl core. =item b Binary backward compatibility; this function is a macro but also has a C implementation (which is exported). =item others See the comments at the top of C for others. =back If you edit F or F, you will need to run C to force a rebuild of F and other auto-generated files. =head2 Formatted Printing of IVs, UVs, and NVs If you are printing IVs, UVs, or NVS instead of the stdio(3) style formatting codes like C<%d>, C<%ld>, C<%f>, you should use the following macros for portability IVdf IV in decimal UVuf UV in decimal UVof UV in octal UVxf UV in hexadecimal NVef NV %e-like NVff NV %f-like NVgf NV %g-like These will take care of 64-bit integers and long doubles. For example: printf("IV is %"IVdf"\n", iv); The IVdf will expand to whatever is the correct format for the IVs. If you are printing addresses of pointers, use UVxf combined with PTR2UV(), do not use %lx or %p. =head2 Pointer-To-Integer and Integer-To-Pointer Because pointer size does not necessarily equal integer size, use the follow macros to do it right. PTR2UV(pointer) PTR2IV(pointer) PTR2NV(pointer) INT2PTR(pointertotype, integer) For example: IV iv = ...; SV *sv = INT2PTR(SV*, iv); and AV *av = ...; UV uv = PTR2UV(av); =head2 Exception Handling There are a couple of macros to do very basic exception handling in XS modules. You have to define C before including F to be able to use these macros: #define NO_XSLOCKS #include "XSUB.h" You can use these macros if you call code that may croak, but you need to do some cleanup before giving control back to Perl. For example: dXCPT; /* set up necessary variables */ XCPT_TRY_START { code_that_may_croak(); } XCPT_TRY_END XCPT_CATCH { /* do cleanup here */ XCPT_RETHROW; } Note that you always have to rethrow an exception that has been caught. Using these macros, it is not possible to just catch the exception and ignore it. If you have to ignore the exception, you have to use the C function. The advantage of using the above macros is that you don't have to setup an extra function for C, and that using these macros is faster than using C. =head2 Source Documentation There's an effort going on to document the internal functions and automatically produce reference manuals from them - L is one such manual which details all the functions which are available to XS writers. L is the autogenerated manual for the functions which are not part of the API and are supposedly for internal use only. Source documentation is created by putting POD comments into the C source, like this: /* =for apidoc sv_setiv Copies an integer into the given SV. Does not handle 'set' magic. See C. =cut */ Please try and supply some documentation if you add functions to the Perl core. =head2 Backwards compatibility The Perl API changes over time. New functions are added or the interfaces of existing functions are changed. The C module tries to provide compatibility code for some of these changes, so XS writers don't have to code it themselves when supporting multiple versions of Perl. C generates a C header file F that can also be run as a Perl script. To generate F, run: perl -MDevel::PPPort -eDevel::PPPort::WriteFile Besides checking existing XS code, the script can also be used to retrieve compatibility information for various API calls using the C<--api-info> command line switch. For example: % perl ppport.h --api-info=sv_magicext For details, see C. =head1 Unicode Support Perl 5.6.0 introduced Unicode support. It's important for porters and XS writers to understand this support and make sure that the code they write does not corrupt Unicode data. =head2 What B Unicode, anyway? In the olden, less enlightened times, we all used to use ASCII. Most of us did, anyway. The big problem with ASCII is that it's American. Well, no, that's not actually the problem; the problem is that it's not particularly useful for people who don't use the Roman alphabet. What used to happen was that particular languages would stick their own alphabet in the upper range of the sequence, between 128 and 255. Of course, we then ended up with plenty of variants that weren't quite ASCII, and the whole point of it being a standard was lost. Worse still, if you've got a language like Chinese or Japanese that has hundreds or thousands of characters, then you really can't fit them into a mere 256, so they had to forget about ASCII altogether, and build their own systems using pairs of numbers to refer to one character. To fix this, some people formed Unicode, Inc. and produced a new character set containing all the characters you can possibly think of and more. There are several ways of representing these characters, and the one Perl uses is called UTF-8. UTF-8 uses a variable number of bytes to represent a character. You can learn more about Unicode and Perl's Unicode model in L. =head2 How can I recognise a UTF-8 string? You can't. This is because UTF-8 data is stored in bytes just like non-UTF-8 data. The Unicode character 200, (C<0xC8> for you hex types) capital E with a grave accent, is represented by the two bytes C. Unfortunately, the non-Unicode string C has that byte sequence as well. So you can't tell just by looking - this is what makes Unicode input an interesting problem. In general, you either have to know what you're dealing with, or you have to guess. The API function C can help; it'll tell you if a string contains only valid UTF-8 characters. However, it can't do the work for you. On a character-by-character basis, C will tell you whether the current character in a string is valid UTF-8. =head2 How does UTF-8 represent Unicode characters? As mentioned above, UTF-8 uses a variable number of bytes to store a character. Characters with values 0...127 are stored in one byte, just like good ol' ASCII. Character 128 is stored as C; this continues up to character 191, which is C. Now we've run out of bits (191 is binary C<10111111>) so we move on; 192 is C. And so it goes on, moving to three bytes at character 2048. Assuming you know you're dealing with a UTF-8 string, you can find out how long the first character in it is with the C macro: char *utf = "\305\233\340\240\201"; I32 len; len = UTF8SKIP(utf); /* len is 2 here */ utf += len; len = UTF8SKIP(utf); /* len is 3 here */ Another way to skip over characters in a UTF-8 string is to use C, which takes a string and a number of characters to skip over. You're on your own about bounds checking, though, so don't use it lightly. All bytes in a multi-byte UTF-8 character will have the high bit set, so you can test if you need to do something special with this character like this (the UTF8_IS_INVARIANT() is a macro that tests whether the byte can be encoded as a single byte even in UTF-8): U8 *utf; U8 *utf_end; /* 1 beyond buffer pointed to by utf */ UV uv; /* Note: a UV, not a U8, not a char */ STRLEN len; /* length of character in bytes */ if (!UTF8_IS_INVARIANT(*utf)) /* Must treat this as UTF-8 */ uv = utf8_to_uvchr_buf(utf, utf_end, &len); else /* OK to treat this character as a byte */ uv = *utf; You can also see in that example that we use C to get the value of the character; the inverse function C is available for putting a UV into UTF-8: if (!UTF8_IS_INVARIANT(uv)) /* Must treat this as UTF8 */ utf8 = uvchr_to_utf8(utf8, uv); else /* OK to treat this character as a byte */ *utf8++ = uv; You B convert characters to UVs using the above functions if you're ever in a situation where you have to match UTF-8 and non-UTF-8 characters. You may not skip over UTF-8 characters in this case. If you do this, you'll lose the ability to match hi-bit non-UTF-8 characters; for instance, if your UTF-8 string contains C, and you skip that character, you can never match a C in a non-UTF-8 string. So don't do that! =head2 How does Perl store UTF-8 strings? Currently, Perl deals with Unicode strings and non-Unicode strings slightly differently. A flag in the SV, C, indicates that the string is internally encoded as UTF-8. Without it, the byte value is the codepoint number and vice versa (in other words, the string is encoded as iso-8859-1, but C is needed to get iso-8859-1 semantics). You can check and manipulate this flag with the following macros: SvUTF8(sv) SvUTF8_on(sv) SvUTF8_off(sv) This flag has an important effect on Perl's treatment of the string: if Unicode data is not properly distinguished, regular expressions, C, C and other string handling operations will have undesirable results. The problem comes when you have, for instance, a string that isn't flagged as UTF-8, and contains a byte sequence that could be UTF-8 - especially when combining non-UTF-8 and UTF-8 strings. Never forget that the C flag is separate to the PV value; you need be sure you don't accidentally knock it off while you're manipulating SVs. More specifically, you cannot expect to do this: SV *sv; SV *nsv; STRLEN len; char *p; p = SvPV(sv, len); frobnicate(p); nsv = newSVpvn(p, len); The C string does not tell you the whole story, and you can't copy or reconstruct an SV just by copying the string value. Check if the old SV has the UTF8 flag set, and act accordingly: p = SvPV(sv, len); frobnicate(p); nsv = newSVpvn(p, len); if (SvUTF8(sv)) SvUTF8_on(nsv); In fact, your C function should be made aware of whether or not it's dealing with UTF-8 data, so that it can handle the string appropriately. Since just passing an SV to an XS function and copying the data of the SV is not enough to copy the UTF8 flags, even less right is just passing a C to an XS function. =head2 How do I convert a string to UTF-8? If you're mixing UTF-8 and non-UTF-8 strings, it is necessary to upgrade one of the strings to UTF-8. If you've got an SV, the easiest way to do this is: sv_utf8_upgrade(sv); However, you must not do this, for example: if (!SvUTF8(left)) sv_utf8_upgrade(left); If you do this in a binary operator, you will actually change one of the strings that came into the operator, and, while it shouldn't be noticeable by the end user, it can cause problems in deficient code. Instead, C will give you a UTF-8-encoded B of its string argument. This is useful for having the data available for comparisons and so on, without harming the original SV. There's also C to go the other way, but naturally, this will fail if the string contains any characters above 255 that can't be represented in a single byte. =head2 Is there anything else I need to know? Not really. Just remember these things: =over 3 =item * There's no way to tell if a string is UTF-8 or not. You can tell if an SV is UTF-8 by looking at its C flag. Don't forget to set the flag if something should be UTF-8. Treat the flag as part of the PV, even though it's not - if you pass on the PV to somewhere, pass on the flag too. =item * If a string is UTF-8, B use C to get at the value, unless C in which case you can use C<*s>. =item * When writing a character C to a UTF-8 string, B use C, unless C in which case you can use C<*s = uv>. =item * Mixing UTF-8 and non-UTF-8 strings is tricky. Use C to get a new string which is UTF-8 encoded, and then combine them. =back =head1 Custom Operators Custom operator support is a new experimental feature that allows you to define your own ops. This is primarily to allow the building of interpreters for other languages in the Perl core, but it also allows optimizations through the creation of "macro-ops" (ops which perform the functions of multiple ops which are usually executed together, such as C.) This feature is implemented as a new op type, C. The Perl core does not "know" anything special about this op type, and so it will not be involved in any optimizations. This also means that you can define your custom ops to be any op structure - unary, binary, list and so on - you like. It's important to know what custom operators won't do for you. They won't let you add new syntax to Perl, directly. They won't even let you add new keywords, directly. In fact, they won't change the way Perl compiles a program at all. You have to do those changes yourself, after Perl has compiled the program. You do this either by manipulating the op tree using a C block and the C module, or by adding a custom peephole optimizer with the C module. When you do this, you replace ordinary Perl ops with custom ops by creating ops with the type C and the C of your own PP function. This should be defined in XS code, and should look like the PP ops in C. You are responsible for ensuring that your op takes the appropriate number of values from the stack, and you are responsible for adding stack marks if necessary. You should also "register" your op with the Perl interpreter so that it can produce sensible error and warning messages. Since it is possible to have multiple custom ops within the one "logical" op type C, Perl uses the value of C<< o->op_ppaddr >> to determine which custom op it is dealing with. You should create an C structure for each ppaddr you use, set the properties of the custom op with C, and register the structure against the ppaddr using C. A trivial example might look like: static XOP my_xop; static OP *my_pp(pTHX); BOOT: XopENTRY_set(&my_xop, xop_name, "myxop"); XopENTRY_set(&my_xop, xop_desc, "Useless custom op"); Perl_custom_op_register(aTHX_ my_pp, &my_xop); The available fields in the structure are: =over 4 =item xop_name A short name for your op. This will be included in some error messages, and will also be returned as C<< $op->name >> by the L module, so it will appear in the output of module like L. =item xop_desc A short description of the function of the op. =item xop_class Which of the various C<*OP> structures this op uses. This should be one of the C constants from F, namely =over 4 =item OA_BASEOP =item OA_UNOP =item OA_BINOP =item OA_LOGOP =item OA_LISTOP =item OA_PMOP =item OA_SVOP =item OA_PADOP =item OA_PVOP_OR_SVOP This should be interpreted as 'C' only. The C<_OR_SVOP> is because the only core C, C, can sometimes be a C instead. =item OA_LOOP =item OA_COP =back The other C constants should not be used. =item xop_peep This member is of type C, which expands to C. If it is set, this function will be called from C when ops of this type are encountered by the peephole optimizer. I is the OP that needs optimizing; I is the previous OP optimized, whose C points to I. =back C directly supports the creation of custom ops by name. =head1 AUTHORS Until May 1997, this document was maintained by Jeff Okamoto Eokamoto@corp.hp.comE. It is now maintained as part of Perl itself by the Perl 5 Porters Eperl5-porters@perl.orgE. With lots of help and suggestions from Dean Roehrich, Malcolm Beattie, Andreas Koenig, Paul Hudson, Ilya Zakharevich, Paul Marquess, Neil Bowers, Matthew Green, Tim Bunce, Spider Boardman, Ulrich Pfeifer, Stephen McCamant, and Gurusamy Sarathy. =head1 SEE ALSO L, L, L, L perltoc.pod000064400002375773147633747270006774 0ustar00 # !!!!!!! DO NOT EDIT THIS FILE !!!!!!! # This file is autogenerated by buildtoc from all the other pods. # Edit those files and run pod/buildtoc to effect changes. =head1 NAME perltoc - perl documentation table of contents =head1 DESCRIPTION This page provides a brief table of contents for the rest of the Perl documentation set. It is meant to be scanned quickly or grepped through to locate the proper section you're looking for. =head1 BASIC DOCUMENTATION =head2 perl - The Perl 5 language interpreter =over 4 =item SYNOPSIS =item GETTING HELP =over 4 =item Overview =item Tutorials =item Reference Manual =item Internals and C Language Interface =item Miscellaneous =item Language-Specific =item Platform-Specific =item Stubs for Deleted Documents =back =item DESCRIPTION =item AVAILABILITY =item ENVIRONMENT =item AUTHOR =item FILES =item SEE ALSO =item DIAGNOSTICS =item BUGS =item NOTES =back =head2 perlintro -- a brief introduction and overview of Perl =over 4 =item DESCRIPTION =over 4 =item What is Perl? =item Running Perl programs =item Safety net =item Basic syntax overview =item Perl variable types Scalars, Arrays, Hashes =item Variable scoping =item Conditional and looping constructs if, while, for, foreach =item Builtin operators and functions Arithmetic, Numeric comparison, String comparison, Boolean logic, Miscellaneous =item Files and I/O =item Regular expressions Simple matching, Simple substitution, More complex regular expressions, Parentheses for capturing, Other regexp features =item Writing subroutines =item OO Perl =item Using Perl modules =back =item AUTHOR =back =head2 perlreftut - Mark's very short tutorial about references =over 4 =item DESCRIPTION =item Who Needs Complicated Data Structures? =item The Solution =item Syntax =over 4 =item Making References =item Using References =item An Example =item Arrow Rule =back =item Solution =item The Rest =item Summary =item Credits =over 4 =item Distribution Conditions =back =back =head2 perldsc - Perl Data Structures Cookbook =over 4 =item DESCRIPTION arrays of arrays, hashes of arrays, arrays of hashes, hashes of hashes, more elaborate constructs =item REFERENCES X X X X =item COMMON MISTAKES =item CAVEAT ON PRECEDENCE X X =item WHY YOU SHOULD ALWAYS C =item DEBUGGING X X X X X X X X X X =item CODE EXAMPLES =item ARRAYS OF ARRAYS X X =over 4 =item Declaration of an ARRAY OF ARRAYS =item Generation of an ARRAY OF ARRAYS =item Access and Printing of an ARRAY OF ARRAYS =back =item HASHES OF ARRAYS X X =over 4 =item Declaration of a HASH OF ARRAYS =item Generation of a HASH OF ARRAYS =item Access and Printing of a HASH OF ARRAYS =back =item ARRAYS OF HASHES X X =over 4 =item Declaration of an ARRAY OF HASHES =item Generation of an ARRAY OF HASHES =item Access and Printing of an ARRAY OF HASHES =back =item HASHES OF HASHES X X =over 4 =item Declaration of a HASH OF HASHES =item Generation of a HASH OF HASHES =item Access and Printing of a HASH OF HASHES =back =item MORE ELABORATE RECORDS X X X =over 4 =item Declaration of MORE ELABORATE RECORDS =item Declaration of a HASH OF COMPLEX RECORDS =item Generation of a HASH OF COMPLEX RECORDS =back =item Database Ties =item SEE ALSO =item AUTHOR =back =head2 perllol - Manipulating Arrays of Arrays in Perl =over 4 =item DESCRIPTION =over 4 =item Declaration and Access of Arrays of Arrays =item Growing Your Own =item Access and Printing =item Slices =back =item SEE ALSO =item AUTHOR =back =head2 perlrequick - Perl regular expressions quick start =over 4 =item DESCRIPTION =item The Guide =over 4 =item Simple word matching =item Using character classes =item Matching this or that =item Grouping things and hierarchical matching =item Extracting matches =item Matching repetitions =item More matching =item Search and replace =item The split operator =back =item BUGS =item SEE ALSO =item AUTHOR AND COPYRIGHT =over 4 =item Acknowledgments =back =back =head2 perlretut - Perl regular expressions tutorial =over 4 =item DESCRIPTION =item Part 1: The basics =over 4 =item Simple word matching =item Using character classes =item Matching this or that =item Grouping things and hierarchical matching =item Extracting matches =item Backreferences =item Relative backreferences =item Named backreferences =item Alternative capture group numbering =item Position information =item Non-capturing groupings =item Matching repetitions =item Possessive quantifiers =item Building a regexp =item Using regular expressions in Perl =back =item Part 2: Power tools =over 4 =item More on characters, strings, and character classes =item Compiling and saving regular expressions =item Composing regular expressions at runtime =item Embedding comments and modifiers in a regular expression =item Looking ahead and looking behind =item Using independent subexpressions to prevent backtracking =item Conditional expressions =item Defining named patterns =item Recursive patterns =item A bit of magic: executing Perl code in a regular expression =item Backtracking control verbs =item Pragmas and debugging =back =item BUGS =item SEE ALSO =item AUTHOR AND COPYRIGHT =over 4 =item Acknowledgments =back =back =head2 perlootut - Object-Oriented Programming in Perl Tutorial =over 4 =item DATE =item DESCRIPTION =item OBJECT-ORIENTED FUNDAMENTALS =over 4 =item Object =item Class =item Methods =item Attributes =item Polymorphism =item Inheritance =item Encapsulation =item Composition =item Roles =item When to Use OO =back =item PERL OO SYSTEMS =over 4 =item Moose Declarative sugar, Roles built-in, A miniature type system, Full introspection and manipulation, Self-hosted and extensible, Rich ecosystem, Many more features =item Class::Accessor =item Object::Tiny =item Role::Tiny =item OO System Summary L, L, L, L =item Other OO Systems =back =item CONCLUSION =back =head2 perlperf - Perl Performance and Optimization Techniques =over 4 =item DESCRIPTION =item OVERVIEW =over 4 =item ONE STEP SIDEWAYS =item ONE STEP FORWARD =item ANOTHER STEP SIDEWAYS =back =item GENERAL GUIDELINES =item BENCHMARKS =over 4 =item Assigning and Dereferencing Variables. =item Search and replace or tr =back =item PROFILING TOOLS =over 4 =item Devel::DProf =item Devel::Profiler =item Devel::SmallProf =item Devel::FastProf =item Devel::NYTProf =back =item SORTING Elapsed Real Time, User CPU Time, System CPU Time =item LOGGING =over 4 =item Logging if DEBUG (constant) =back =item POSTSCRIPT =item SEE ALSO =over 4 =item PERLDOCS =item MAN PAGES =item MODULES =item URLS =back =item AUTHOR =back =head2 perlstyle - Perl style guide =over 4 =item DESCRIPTION =back =head2 perlcheat - Perl 5 Cheat Sheet =over 4 =item DESCRIPTION =over 4 =item The sheet =back =item ACKNOWLEDGEMENTS =item AUTHOR =item SEE ALSO =back =head2 perltrap - Perl traps for the unwary =over 4 =item DESCRIPTION =over 4 =item Awk Traps =item C/C++ Traps =item Sed Traps =item Shell Traps =item Perl Traps =item Perl4 to Perl5 Traps Discontinuance, Deprecation, and BugFix traps, Parsing Traps, Numerical Traps, General data type traps, Context Traps - scalar, list contexts, Precedence Traps, General Regular Expression Traps using s///, etc, Subroutine, Signal, Sorting Traps, OS Traps, DBM Traps, Unclassified Traps =item Discontinuance, Deprecation, and BugFix traps Symbols starting with "_" no longer forced into main, Double-colon valid package separator in variable name, 2nd and 3rd args to C are now in scalar context, Can't do C into a block that is optimized away, Can't use whitespace as variable name or quote delimiter, C gone, C<**> binds tighter than unary minus, C changed when iterating over a list, C with no args behavior changed, B<-e> behavior fixed, C returns number of elements in resulting list, Some error messages differ, C honors subroutine args, Bugs removed =item Parsing Traps Space between . and = triggers syntax error, Better parsing in perl 5, Function parsing, String interpolation of C<$#array> differs, Perl guesses on C, C followed by C<{> if it starts BLOCK or hash ref =item Numerical Traps Formatted output and significant digits, Auto-increment operator over signed int limit deleted, Assignment of return values from numeric equality tests doesn't work, Bitwise string ops =item General data type traps Negative array subscripts now count from the end of array, Setting C<$#array> lower now discards array elements, Hashes get defined before use, Glob assignment from localized variable to variable, Assigning C to glob, Changes in unary negation (of strings), Modifying of constants prohibited, C behavior changed, Variable Suicide =item Context Traps - scalar, list contexts Elements of argument lists for formats evaluated in list context, C returns false value in scalar context if no caller present, Comma operator in scalar context gives scalar context to args, C prototyped as C<($;@)> =item Precedence Traps LHS vs. RHS of any assignment operator, Semantic errors introduced due to precedence, Precedence of assignment operators same as the precedence of assignment, C requires parentheses around filehandle, C<$:> precedence over C<$::> gone, Precedence of file test operators documented, C, C, C are regular named unary operators =item General Regular Expression Traps using s///, etc. C interpolates on either side, C attaches its state to the searched string, C used within an anonymous sub, C<$+> isn't set to whole match, Substitution now returns null string if it fails, C is now a normal substitution, Stricter parsing of variables in regular expressions, C matches only once, Failed matches don't reset the match variables =item Subroutine, Signal, Sorting Traps Barewords that used to look like strings look like subroutine calls, Reverse is no longer allowed as the name of a sort subroutine, C won't let you specify a filehandle =item OS Traps SysV resets signal handler correctly, SysV C appends correctly =item Interpolation Traps C<@> always interpolates an array in double-quotish strings, Double-quoted strings may no longer end with an unescaped $, Arbitrary expressions are evaluated inside braces within double quotes, C<$$x> now tries to dereference $x, Creation of hashes on the fly with C requires protection, Bugs in earlier perl versions, Array and hash brackets during interpolation, Interpolation of C<\$$foo{bar}>, C string passed to C will not find string terminator =item DBM Traps Perl5 must have been linked with same dbm/ndbm as the default for C, DBM exceeding limit on the key/value size will cause perl5 to exit immediately =item Unclassified Traps C/C trap using returned value, C on empty string with LIMIT specified =back =back =head2 perldebtut - Perl debugging tutorial =over 4 =item DESCRIPTION =item use strict =item Looking at data and -w and v =item help =item Stepping through code =item Placeholder for a, w, t, T =item REGULAR EXPRESSIONS =item OUTPUT TIPS =item CGI =item GUIs =item SUMMARY =item SEE ALSO =item AUTHOR =item CONTRIBUTORS =back =head2 perlfaq - frequently asked questions about Perl =over 4 =item DESCRIPTION =over 4 =item Where to find the perlfaq =item How to use the perlfaq =item How to contribute to the perlfaq =item What if my question isn't answered in the FAQ? =back =item TABLE OF CONTENTS perlfaq1 - General Questions About Perl, perlfaq2 - Obtaining and Learning about Perl, perlfaq3 - Programming Tools, perlfaq4 - Data Manipulation, perlfaq5 - Files and Formats, perlfaq6 - Regular Expressions, perlfaq7 - General Perl Language Issues, perlfaq8 - System Interaction, perlfaq9 - Web, Email and Networking =item THE QUESTIONS =over 4 =item L: General Questions About Perl =item L: Obtaining and Learning about Perl =item L: Programming Tools =item L: Data Manipulation =item L: Files and Formats =item L: Regular Expressions =item L: General Perl Language Issues =item L: System Interaction =item L: Web, Email and Networking =back =item CREDITS =item AUTHOR AND COPYRIGHT =back =head2 perlfaq1 - General Questions About Perl =over 4 =item DESCRIPTION =over 4 =item What is Perl? =item Who supports Perl? Who develops it? Why is it free? =item Which version of Perl should I use? =item What are Perl 4, Perl 5, or Perl 6? =item What is Perl 6? =item How stable is Perl? =item Is Perl difficult to learn? =item How does Perl compare with other languages like Java, Python, REXX, Scheme, or Tcl? =item Can I do [task] in Perl? =item When shouldn't I program in Perl? =item What's the difference between "perl" and "Perl"? =item What is a JAPH? =item How can I convince others to use Perl? L, L =back =item AUTHOR AND COPYRIGHT =back =head2 perlfaq2 - Obtaining and Learning about Perl =over 4 =item DESCRIPTION =over 4 =item What machines support Perl? Where do I get it? =item How can I get a binary version of Perl? =item I don't have a C compiler. How can I build my own Perl interpreter? =item I copied the Perl binary from one machine to another, but scripts don't work. =item I grabbed the sources and tried to compile but gdbm/dynamic loading/malloc/linking/... failed. How do I make it work? =item What modules and extensions are available for Perl? What is CPAN? =item Where can I get information on Perl? L, L, L =item What is perl.com? Perl Mongers? pm.org? perl.org? cpan.org? L, L, L, L =item Where can I post questions? =item Perl Books =item Which magazines have Perl content? =item Which Perl blogs should I read? =item What mailing lists are there for Perl? =item Where can I buy a commercial version of Perl? =item Where do I send bug reports? =back =item AUTHOR AND COPYRIGHT =back =head2 perlfaq3 - Programming Tools =over 4 =item DESCRIPTION =over 4 =item How do I do (anything)? Basics, L - Perl data types, L - Perl pre-defined variables, L - Perl syntax, L - Perl operators and precedence, L - Perl subroutines, Execution, L - how to execute the Perl interpreter, L - Perl debugging, Functions, L - Perl builtin functions, Objects, L - Perl references and nested data structures, L - Perl modules (packages and symbol tables), L - Perl objects, L - how to hide an object class in a simple variable, Data Structures, L - Perl references and nested data structures, L - Manipulating arrays of arrays in Perl, L - Perl Data Structures Cookbook, Modules, L - Perl modules (packages and symbol tables), L - constructing new Perl modules and finding existing ones, Regexes, L - Perl regular expressions, L - Perl builtin functions>, L - Perl operators and precedence, L - Perl locale handling (internationalization and localization), Moving to perl5, L - Perl traps for the unwary, L, Linking with C, L - Tutorial for writing XSUBs, L - XS language reference manual, L - Perl calling conventions from C, L - Introduction to the Perl API, L - how to embed perl in your C program, Various =item How can I use Perl interactively? =item How do I find which modules are installed on my system? =item How do I debug my Perl programs? =item How do I profile my Perl programs? =item How do I cross-reference my Perl programs? =item Is there a pretty-printer (formatter) for Perl? =item Is there an IDE or Windows Perl Editor? Eclipse, Enginsite, Komodo, Notepad++, Open Perl IDE, OptiPerl, Padre, PerlBuilder, visiPerl+, Visual Perl, Zeus, GNU Emacs, MicroEMACS, XEmacs, Jed, Vim, Vile, Codewright, MultiEdit, SlickEdit, ConTEXT, Bash, Ksh, Tcsh, Zsh, Affrus, Alpha, BBEdit and BBEdit Lite =item Where can I get Perl macros for vi? =item Where can I get perl-mode or cperl-mode for emacs? X =item How can I use curses with Perl? =item How can I write a GUI (X, Tk, Gtk, etc.) in Perl? X X X X X X X X Tk, Wx, Gtk and Gtk2, Win32::GUI, CamelBones, Qt, Athena =item How can I make my Perl program run faster? =item How can I make my Perl program take less memory? Don't slurp!, Use map and grep selectively, Avoid unnecessary quotes and stringification, Pass by reference, Tie large variables to disk =item Is it safe to return a reference to local or lexical data? =item How can I free an array or hash so my program shrinks? =item How can I make my CGI script more efficient? =item How can I hide the source for my Perl program? =item How can I compile my Perl program into byte code or C? =item How can I get C<#!perl> to work on [MS-DOS,NT,...]? =item Can I write useful Perl programs on the command line? =item Why don't Perl one-liners work on my DOS/Mac/VMS system? =item Where can I learn about CGI or Web programming in Perl? =item Where can I learn about object-oriented Perl programming? =item Where can I learn about linking C with Perl? =item I've read perlembed, perlguts, etc., but I can't embed perl in my C program; what am I doing wrong? =item When I tried to run my script, I got this message. What does it mean? =item What's MakeMaker? =back =item AUTHOR AND COPYRIGHT =back =head2 perlfaq4 - Data Manipulation =over 4 =item DESCRIPTION =item Data: Numbers =over 4 =item Why am I getting long decimals (eg, 19.9499999999999) instead of the numbers I should be getting (eg, 19.95)? =item Why is int() broken? =item Why isn't my octal data interpreted correctly? =item Does Perl have a round() function? What about ceil() and floor()? Trig functions? =item How do I convert between numeric representations/bases/radixes? How do I convert hexadecimal into decimal, How do I convert from decimal to hexadecimal, How do I convert from octal to decimal, How do I convert from decimal to octal, How do I convert from binary to decimal, How do I convert from decimal to binary =item Why doesn't & work the way I want it to? =item How do I multiply matrices? =item How do I perform an operation on a series of integers? =item How can I output Roman numerals? =item Why aren't my random numbers random? =item How do I get a random number between X and Y? =back =item Data: Dates =over 4 =item How do I find the day or week of the year? =item How do I find the current century or millennium? =item How can I compare two dates and find the difference? =item How can I take a string and turn it into epoch seconds? =item How can I find the Julian Day? =item How do I find yesterday's date? X X X X X X X X X X =item Does Perl have a Year 2000 or 2038 problem? Is Perl Y2K compliant? =back =item Data: Strings =over 4 =item How do I validate input? =item How do I unescape a string? =item How do I remove consecutive pairs of characters? =item How do I expand function calls in a string? =item How do I find matching/nesting anything? =item How do I reverse a string? =item How do I expand tabs in a string? =item How do I reformat a paragraph? =item How can I access or change N characters of a string? =item How do I change the Nth occurrence of something? =item How can I count the number of occurrences of a substring within a string? =item How do I capitalize all the words on one line? X X X X =item How can I split a [character]-delimited string except when inside [character]? =item How do I strip blank space from the beginning/end of a string? =item How do I pad a string with blanks or pad a number with zeroes? =item How do I extract selected columns from a string? =item How do I find the soundex value of a string? =item How can I expand variables in text strings? =item What's wrong with always quoting "$vars"? =item Why don't my EEHERE documents work? There must be no space after the EE part, There (probably) should be a semicolon at the end of the opening token, You can't (easily) have any space in front of the tag, There needs to be at least a line separator after the end token =back =item Data: Arrays =over 4 =item What is the difference between a list and an array? =item What is the difference between $array[1] and @array[1]? =item How can I remove duplicate elements from a list or array? =item How can I tell whether a certain element is contained in a list or array? =item How do I compute the difference of two arrays? How do I compute the intersection of two arrays? =item How do I test whether two arrays or hashes are equal? =item How do I find the first array element for which a condition is true? =item How do I handle linked lists? =item How do I handle circular lists? X X X X X X =item How do I shuffle an array randomly? =item How do I process/modify each element of an array? =item How do I select a random element from an array? =item How do I permute N elements of a list? X X X X X X =item How do I sort an array by (anything)? =item How do I manipulate arrays of bits? =item Why does defined() return true on empty arrays and hashes? =back =item Data: Hashes (Associative Arrays) =over 4 =item How do I process an entire hash? =item How do I merge two hashes? X X X =item What happens if I add or remove keys from a hash while iterating over it? =item How do I look up a hash element by value? =item How can I know how many entries are in a hash? =item How do I sort a hash (optionally by value instead of key)? =item How can I always keep my hash sorted? X =item What's the difference between "delete" and "undef" with hashes? =item Why don't my tied hashes make the defined/exists distinction? =item How do I reset an each() operation part-way through? =item How can I get the unique keys from two hashes? =item How can I store a multidimensional array in a DBM file? =item How can I make my hash remember the order I put elements into it? =item Why does passing a subroutine an undefined element in a hash create it? =item How can I make the Perl equivalent of a C structure/C++ class/hash or array of hashes or arrays? =item How can I use a reference as a hash key? =item How can I check if a key exists in a multilevel hash? =item How can I prevent addition of unwanted keys into a hash? =back =item Data: Misc =over 4 =item How do I handle binary data correctly? =item How do I determine whether a scalar is a number/whole/integer/float? =item How do I keep persistent data across program calls? =item How do I print out or copy a recursive data structure? =item How do I define methods for every class/object? =item How do I verify a credit card checksum? =item How do I pack arrays of doubles or floats for XS code? =back =item AUTHOR AND COPYRIGHT =back =head2 perlfaq5 - Files and Formats =over 4 =item DESCRIPTION =over 4 =item How do I flush/unbuffer an output filehandle? Why must I do this? X X X X =item How do I change, delete, or insert a line in a file, or append to the beginning of a file? X =item How do I count the number of lines in a file? X X X =item How do I delete the last N lines from a file? X X =item How can I use Perl's C<-i> option from within a program? X<-i> X =item How can I copy a file? X X X =item How do I make a temporary file name? X =item How can I manipulate fixed-record-length files? X X =item How can I make a filehandle local to a subroutine? How do I pass filehandles between subroutines? How do I make an array of filehandles? X X X =item How can I use a filehandle indirectly? X =item How can I set up a footer format to be used with write()? X