Language |
Overloaded string/bytes type |
Unambiguous array-of-bytes type |
Unambiguous textual-string type |
C++ 11 |
std::string |
std::vector<byte> |
– |
C# |
– |
byte[] |
string |
Java |
– |
byte[] |
String |
Perl 5 |
– |
– |
PHP 5 |
string |
– |
– |
Python <= 2.5 |
str |
– |
unicode |
Python >= 2.6 |
str |
bytearray (4) |
unicode |
Python 3 |
– |
bytes, bytearray |
str |
Ruby 1.8 |
String |
– |
– (5) |
Ruby 1.9 |
– |
– |
String |
C++ 11
Since we use a box type, Variant, in C++, any difficulty interpreting strings is easily handled by qualifying the value container. It doesn't seem like too much of a stretch to me to also take this approach with bindings for languages that don't offer disambiguated types.
C++ 11 doesn't seem to have a dedicated type for unicode. It has wide characters (not the same thing), and it has literal syntax for unicode strings. These resolve, however, to arrays of char, char16_t, or char32_t, so there's no type signal we can easily use to figure out the developer's intention.
Perl 5
I know too little about perl to say what's going on here. Scalar::Util reftype seems to offer a way to get more type info. Perl has an array type, but its use for byte arrays doesn't appear to be recommended.
[jross@localhost ~]$ perl -e 'use Scalar::Util qw(reftype); my $foo = "hello"; print reftype(\$foo) . "\n"' SCALAR [jross@localhost ~]$ perl -e 'use Scalar::Util qw(reftype); my $foo = "hello"; print reftype([]) . "\n"' ARRAY
PHP 5 has an array type for bytes, but it's really a map with integer keys, which I would consider too inefficient for this application.
Python 2
Python 2's 'bytes' type is simply an alias for str, so we can't use it to disambiguate. Python >= 2.6 does, however, have bytearray, which I think would serve well enough.
Ruby 1.8
Ruby <= 1.8 doesn't have explicit string encodings, and I can't tell what the default is.
Ruby >= 1.9
Ruby >= 1.9 seems to offer everything we need. Er, I'm wrong, it doesn't. It has strings with encodings; it doesn't have an explicit binary data type that is easily distinguished from text. One could use an Array of ints, but that's perhaps less efficient than we need.
irb(main):003:0> x = [1, 2, 3, 255] => [1, 2, 3, 255] irb(main):004:0> x.class => Array irb(main):016:0> x = "holla" => "holla" irb(main):017:0> x.class => String irb(main):018:0> x.encoding => #<Encoding:UTF-8>
Gordon Sim
What about python 2.4?
Justin Ross
Older 2.x versions of python have 'buffer', which looks like it could work.