Squared Programming Articles: utf 8

Showing posts with label utf 8. Show all posts

Thursday, December 19, 2013

STL-Style UTF-8 String Class now on GitHub

I've recently uploaded the files from the blog series "Writing a STL-Style UTF-8 String Class" to Github. To keep up with all of the latest updates, use the files on Github. The link can be found here: Squared'D Programming UTF-8 String Class - https://github.com/squaredprogramming/sdp_utf8string

Some additions:

I'm now using the BSD License. This means the source code can be used without restrictions as longs as you keep the copyright notice. The utf8string class is totally free.
When I was updating the source code of one of my projects to use the utf8string class, I noticed that I hadn't implemented the comparison operators so they have been added now.

For the latest information, be sure to check back here. I do more than just program UTF-8 related stuff. I also develop indie video games. Here's a link to my game development blog - http://www.gamedev.net/blog/1670-squaredds-journal/

Sunday, December 15, 2013

Squared'D Programming Project Videos

I've been slowly trying to build up my video library on Youtube. Most of the videos focus on my main project Auxnet: Battlegrounds, but I'm also trying to add some instructional videos as well. Before sure to check them out and tell me what you think about them.

Introduction to UTF-8
This video gives an introduction to UTF-8 and Unicode. It gives a detail description of UTF-8 and how to encode in UTF-8. This is a video presentation of the article "How about Unicode and UTF-8" which was published on www.gamedev.net.

Auxnet: Battlegrounds Intro (IGF 2014)
Auxnet is a futuristic, multi-player, Sci Fi video game. In the future, the internet has been replaced by super networks where uses can enter virtual worlds by connecting there consciousnesses to the network. They can experience many places and learn in a classroom, from there own homes. Auxnet is one of those super networks. On Auxnet, users can play games in virtual arenas.

Auxnet: Battlegrounds - Pre-alpha Gameplay Preview
This is just a short video that shows a model moving around the scene with some simple AI controlled characters. The game engine was written using Direct X by Dominque Douglas (Squared'D) and a lot of the core functionality has been completed. Most programming work is now centered on gameplay and AI. The game is still in it's pre-alpha phase so a lot of the art is still in it's beginning stages and will undergo many upgrades before the final build. We hope this demo video shows some of our ability, but we expect much more and much better quality in the coming months.

Genesis SEED Intro (IGF 2013)

This is the intro of a past IGF entry that we worked on. It used a previous version of the current 3D engine. The future of this game is not as yet known, but this intro deserves to be preserved because of all the effort that went into it.

Game Engine Entity Attachments
Have you ever wondered how to attach weapons to a 3D model in C++ and have it animate with your model? In this video blog, I give details about how I've implemented attachments to the entities in my game engine. Attachments are separate 3D models that are basically add-ons to the base model. They can be used for weapons, accessories, etc. This video should be detailed enough for you to add attachments to your existing 3D model animation system.

Effects Test

This is a short little video showing the effects system that I have been developing

Wednesday, December 11, 2013

Writing a STL-style UTF-8 String Class Part 5

Update: The latest code can now be found on GitHub. Check the post "STL-Style UTF-8 String Class now on GitHub" for more information.
____________________________________________________________________________

In this final part of my blog series "Writing a STL-style UTF-8 String Class", I will continue to make this string clas behave more like std::string. First I added support for using different allocators. How did I go about doing this? I took a lesson from std::basic_string and did it using templates. This means the entire utf8string class is now a template class. This is the basic setup now:

namespace sd_utf8
{

template <class Alloc = std::allocator<_uchar8bit>>
class _utf8string
{
 public:
  // some types that we need to define to make this work like an stl object
  // internally this is an std string, but outwardly, it returns __char32bit
  typedef _char32bit   value_type;
  typedef _char32bit   *pointer;
  typedef const _char32bit *const_pointer;
  typedef _char32bit   &reference;
  typedef const _char32bit &const_reference;
  typedef size_t    size_type;
  typedef ptrdiff_t   difference_type;

  ...

  // make our iterator types here
  typedef utf8string_iterator<value_type>   iterator;
  typedef utf8string_iterator<const value_type> const_iterator;
  typedef value_reverse_iterator<iterator>  reverse_iterator;
  typedef value_reverse_iterator<const_iterator> const_reverse_iterator;

 private:
  std::basic_string<_uchar8bit, std::char_traits<unsigned char>, Alloc> utfstring_data;

 public:
  ...


};

typedef _utf8string<> utf8string;

}

I've added the code "typedef _utf8string<> utf8string;" so I'll still be able to use utf8string with the default template parameters. Using templates in this way makes adding an allocator very simple. Of course, this also mean some changes were neccessary in utf8utils.h"

<// use a template method because the 16bit and 32 bit implementations are identical
// except for the type
template <typename char_type, typename Alloc>
inline void MakeUTF8StringImpl(const char_type* instring, std::basic_string<_uchar8bit data-blogger-escaped-alloc=""> &out, bool appendToOut)
{
...
}

Next I added all of the other std::string methods(functions), but not all of the overloads. I learned something working on this project. std::string has a ton of functions. Since the class uses a std::basic_string internally, I was able to use many of the STL functions to do the work. I just had to make sure the parameters were correct. replace() was a little trickier, but then I realized that replace was just an erase and then an insert.

_utf8string<Alloc>& replace (size_type pos, size_type len, const _utf8string<Alloc>& str)
{
 // make copy so exceptions won't change string
 _utf8string<Alloc> temp_copy(*this);

 // erase
 temp_copy.erase(pos, len);

 // insert
 temp_copy.insert(pos, str);

 assign(temp_copy);

 return *this;
}

Most of the function overloads that I didn't implement were ones that took iterators as parameters. Since my iterators are always constant and also since they don't have a reference to the actual utf8string object, addind those functions would have been difficult, but not impossible. I could have implemented them, but I'd have to re-scan the string each time and I thought that would be a bit wasteful.

This was a fun project for me. Here's a tip if you ever want to build something like this. There are so many functions that you need to try to reuse code as much as you can and find easy ways to do things.

What's next? Over the next few weeks, I'll continue to tinker with the code. My plan is to put this up on SourceForge and make it opensource. I'll try to get that set up within the next two weeks, but in the meantime, I need to work on my main project Auxnet: Battlegrounds. Thanks for reading this long series. I hope you were able to benefit from it.

For the complete source code

utf8string.h
utf8utils.h

-------------
For the other post in this series:

Writing a STL-Style UTF-8 String Class Part 1
Writing a STL-Style UTF-8 String Class Part 2
Writing a STL-Style UTF-8 String Class Part 3
Writing a STL-style UTF-8 String Class Part 4
Writing a STL-style UTF-8 String Class Part 5

Tuesday, December 10, 2013

Writing a STL-Style UTF-8 String Class Part 4

Update: The latest code can now be found on GitHub. Check the post "STL-Style UTF-8 String Class now on GitHub" for more information.
____________________________________________________________________________

In this fourth installment of "Writing a STL-Style UTF-8 String Class", I'll show the first version of the utf8string class and in my next post, I'll add all the other remaining member functions to make it behave like std::string.

One nice thing about UTF-8 is that it's compatible with normal 8-bit string functions. There is only one little problem. Processing UTF-8 requires using unsigned 8-bit characters, but some compilers have the char data type as a signed type. This means if we just cast the pointers, typical string functions will read our data that's over 127 as negative numbers. This is OK in most situations, but it will through of less than (<) and greater than (>) comparison functions. To get around this and still be able to utilize some of C++'s STL string capabilities, I'm going to cut out some of my code. I will remove the following lines:

pointer  buffer;
size_type reserve_size;
size_type used_length;

And replace them with this

std::basic_string<_uchar8bit> utfstring_data;

Also, the string will use _uchar8bit (unsigned char) internally, but externally, the class should now return _char32bit(unsigned int). This is because I want the class to return the decoded Unicode value, and not the UTF-8 encoding. This means our internal types will need to change.

typedef _char32bit   value_type;
typedef _char32bit   *pointer;
typedef const _char32bit *const_pointer;
typedef _char32bit   &reference;
typedef const _char32bit &const_reference;
typedef size_t    size_type;
typedef ptrdiff_t   difference_type;

This is significant in another way. This class can no longer return references to elements in the underlying array. It can only return values. If the class can only return values, then the iterator also should only return values. This means that our iterator will always be a const iterator that cannot change the string. This also means that we can no longer use the default std::reverse_iterator. That template class expects the iterator to return references and not values. If we try to use it, we'll get compiler errors. So we'll also need to write our
own reverse iterator. Creating our own reverse iterator is not overly difficult. The key to making a reverse iterator is having it use the forward iterator as a member. Here's an example:

template <class TBaseIterator>
class value_reverse_iterator : public std::iterator<std::bidirectional_iterator_tag, value_type>
{
 public:
  TBaseIterator forward_iterator;

 public:
  // copy constructor
  value_reverse_iterator(const value_reverse_iterator &other)
   :forward_iterator(other.forward_iterator)
  {
  }

  // create from forward iterator
  value_reverse_iterator(const TBaseIterator &iterator)
   :forward_iterator(iterator)
  {
   int a;
   a = 5;
  }

  value_type operator*() const
  {
   TBaseIterator temp = forward_iterator;
   return *(--temp);
  }

  // does not check to see if it goes past the end
  // iterating past the end is undefined
  value_reverse_iterator &operator++()
  {
   --forward_iterator;

   return *this;
  }

  // does not check to see if it goes past the end
  // iterating past the end is undefined
  value_reverse_iterator operator++(int)
  {
   value_reverse_iterator copy(*this);

   // increment
   --forward_iterator;

   return copy;
  }

  // does not check to see if it goes past the end
  // iterating past begin is undefined
  value_reverse_iterator &operator--()
  {
   ++forward_iterator;
   return *this;
  }

  // does not check to see if it goes past the end
  // iterating past begin is undefined
  value_reverse_iterator operator--(int)
  {
   value_reverse_iterator copy(*this);

   ++forward_iterator;

   return copy;
  }

  bool operator == ( const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator == other.forward_iterator;
  }
 
  bool operator != (const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator != other.forward_iterator;
  }

  bool operator < ( const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator > other.forward_iterator;
  }
 
  bool operator > (const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator < other.forward_iterator;
  }
  
  bool operator <= ( const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator >= other.forward_iterator;
  }
 
  bool operator >= (const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator <= other.forward_iterator;
  }

};

The next step is to use the UTF-8 utility functions that were made in the Part 3 to make this work with the UTF-8 encoded text. Also, all methods and operators that returned references before should now return values.

Many new constructors are needed because this should be able to automatically convert between std::string, std::wstring, and be able to use char and wchar_t type characters.

// default constructor
utf8string();

// build from a c string
// undefined (ie crashes) if str is NULL
utf8string(const _char8bit *str);

// build from a c string
// undefined (ie crashes) if str is NULL
utf8string(const _uchar8bit *str);

// construct from an unsigned char
utf8string(size_t n, _uchar8bit c);

// construct from a normal char
utf8string(size_t n, _char8bit c);

// construct from a normal char
utf8string(_uchar8bit c);

// construct from a normal char
utf8string(_char8bit c);

// construct from a normal char
utf8string(_char16bit c);

// construct from a normal char
utf8string(_char32bit c);

// copy constructor
utf8string(const utf8string &str);

/// \brief Constructs a UTF-8 string from an 16 bit character terminated string
utf8string (const _char16bit* instring_UCS2);

/// \brief Constructs a UTF-8 string from an 32 bit character terminated string
utf8string (const _char32bit* instring_UCS4);

/// \brief copy constructor from basic std::string
utf8string(const std::string &instring);

/// \brief copy constructor from basic std::string
utf8string(const std::wstring &instring);

I use the same pre-conditions as std::string. In other words, if std::string doesn't check the correctness of a parameter, I don't either. I've also written some extra constructors such as utf8string(char c). This is so I can do calls such as string += 'c' without having to overload the += operator with every type. If you provide constructors, the compiler will be able to construct a new utf8string and just use the += operator that takes "const utf8string &" as a parameter.

I'm also providing three versions of the copy method

// copies a sub string of this string to s and returns the number of characters copied
// if this string is shorter than len, as many characters as possible are copied
// undefined behavior if the buffer pointed to by s is not long enough
// UTF-8 version
size_type copy (_uchar8bit *s, size_type len, size_type pos = 0) const;

// copies a sub string of this string to s and returns the number of characters copied
// if this string is shorter than len, as many characters as possible are copied
// undefined behavior if the buffer pointed to by s is not long enough
// outputs to UCS-2
size_type copy (_char16bit *s, size_type len, size_type pos = 0) const;

// outputs to UCS-4
size_type copy (_char32bit *s, size_type len, size_type pos = 0) const;

For the complete source code so far, you can download these two files

utf8string.0.h
utf8utils.h

-------------
For the other post in this series:

Writing a STL-Style UTF-8 String Class Part 1
Writing a STL-Style UTF-8 String Class Part 2
Writing a STL-Style UTF-8 String Class Part 3
Writing a STL-style UTF-8 String Class Part 4

Writing a STL-style UTF-8 String Class Part 5

Monday, December 9, 2013

Writing a STL-Style UTF-8 String Class Part 3

Update: The latest code can now be found on GitHub. Check the post "STL-Style UTF-8 String Class now on GitHub" for more information.
____________________________________________________________________________

So far I've written a basic string class and have given it an iterator, but there's something missing. It doesn't use UTF-8. In this post, I'll introduce some UTF-8 utility functions that the string class will use.

The following information requires a knowledge of UTF-8. If you'd like information on it, you can watch
my video presentation that gives a good introduction to UTF-8 or check out the first post in this series for some links.

To add UTF-8 to this class, first I will define a new namespace to keep everything.

namespace sd_utf8
{
}

I'll also add another header with some UTF-8 utility functions. These functions will be used with the utf8string class, but application programmers will also be able to use the utility functions to add UTF-8 capabilities to their existing classes.

The functions will need this type which is a 4 character array. Encoded UTF-8 will be put here.

typedef _uchar8bit utf8_encoding[4];

Here some of the utility functions. All functions are inline because they are all defined in the header file.

The GetUTF8Encoding function will encode at 32 bit Unicode value into UTF-8. The size of the encoding will be returnd in out_size. This function can reorder the incoming data if it's in an endian that is the opposite of the current system.

/// This function generates a UTF-8 encoding from a 32 bit UCS-4 character.
/// This is being provided as a static method so it can be used with normal std::string objects
/// default_order is true when the byte order matches the system
inline void GetUTF8Encoding(_char32bit in_char, utf8_encoding &out_encoding, int &out_size, bool default_order = true)
{
 // check the order byte order and reorder if neccessary
 if(default_order == false)
 {
  in_char = ((in_char & 0x000000ff) << 24) + ((in_char & 0x0000ff00) << 8) + ((in_char & 0x00ff0000) >> 8) + ((in_char & 0xff000000) >> 24);
 }

 if(in_char < 0x80)
 {
  // 1 byte encoding
  out_encoding[0] = (char)in_char;
  out_size = 1;
 }
 else if(in_char < 0x800)
 {
  // 2 byte encoding
  out_encoding[0] = 0xC0 + ((in_char & 0x7C0) >> 6);
  out_encoding[1] = 0x80 + (in_char & 0x3F);
  out_size = 2;
 }
 else if(in_char < 0x10000)
 {
  // 3 byte encoding
  out_encoding[0] = 0xE0 + ((in_char & 0xF000) >> 12);
  out_encoding[1] = 0x80 + ((in_char & 0xFC0) >> 6);
  out_encoding[2] = 0x80 + (in_char & 0x3F);
  out_size = 3;
 }
 else
 {
  // 4 byte encoding
  out_encoding[0] = 0xF8 + ((in_char & 0x1C0000) >> 18);
  out_encoding[1] = 0x80 + ((in_char & 0x3F000) >> 12);
  out_encoding[2] = 0x80 + ((in_char & 0xFC0) >> 6);
  out_encoding[3] = 0x80 + (in_char & 0x3F);
  out_size = 4;
 }
}

inline void GetUTF8Encoding(_char16bit in_char, utf8_encoding &out_encoding, int &out_size, bool default_order = true)
{
 // check the order byte order and reorder if neccessary
 if(default_order == false)
 {
  in_char = ((in_char & 0x00ff) << 8) + ((in_char & 0xff00) >> 8);
 }

 // to reduce redundant code and possible bugs from typingg errors, use 32bit version
 GetUTF8Encoding((_char32bit)in_char, out_encoding, out_size, true);
}

The function UTF8CharToUnicode will read the next unicode character in a string and return a 32 bit unicode value.

inline _char32bit UTF8CharToUnicode(const _uchar8bit *utf8data)
{
 if(utf8data[0] < 0x80)
 {
  return (_char32bit)utf8data[0];
 }
 else if(utf8data[0] < 0xE0)
 {
  // 2 bytes
  return ((utf8data[0] & 0x1F) << 6) + (utf8data[1] & 0x3F);
 }
 else if (utf8data[0] < 0xF0)
 {
  // 3 bytes
  return ((utf8data[0] & 0xF) << 12) + ((utf8data[1] & 0x3F) << 6) + (utf8data[2] & 0x3F);
 }
 else
 {
  // 4 bytes
  return ((utf8data[0] & 0x7) << 18) + ((utf8data[1] & 0x3F) << 12) + ((utf8data[2] & 0x3F) << 6) + (utf8data[3] & 0x3F);
 }
}

Those two functions do all of the encoding and decoding work. In the UTF-8 code, I often just use simple < operations to determine the size of the encoding. Remember, in 4 byte encodings, the first byte will always be of the form 1111 0XXX. This means that for byte encodings will always be greater that 1111 0000 (F0 hex). 3-byte encodings have the form 1110 0000 (E0) so 3-byte encodings will be between E0 and F0. The same holds true for 2 and 1-byte encodings. So instead of doing fancy bit operations, a simple comparison is all that's needed.

Here's a list of all the other functions:

This function will increment a pointer to a UTF-8 string to the correct character position. It sets the pointer to point to the null-terminator if the position is off the string. Behavior is undefined is string doesn't point to a properly formated UTF-8 string.

inline void IncrementToPosition(const _uchar8bit *&string, size_t pos);

This function will a UTF-8 encoded string and returns the actual begining in the buffer of the character at pos. Behavior is undefined is string doesn't point to a properly formated UTF-8 string or if pos is out of range

inline size_t GetBufferPosition(const _uchar8bit *string, size_t pos);

This function will get the minimum amount of memory needed to encode the string in UTF-8.

template <class T>
inline size_t GetMinimumBufferSize(const T *string);

Template function to convert a string into UTF-8 and stores the result in an std::basic_string. The type should be convertible to an int. A template function is being used because the 16-bit and 32-bit implementations are identical except for the type

template <typename char_type>
inline void MakeUTF8StringImpl(const char_type* instring, std::basic_string<_uchar8bit> &out, bool appendToOut);

Template function to convert a string into UTF-8 and stores the result in a buffer. The type should be convertible to an int. A template function is being used because the 16-bit and 32-bit implementations are identical except for the type. Out should point to a buffer large enough to hold the data.

template <typename char_type>
inline void MakeUTF8StringImpl(const char_type* instring, _uchar8bit *out);

This function uses the template function MakeUTF8StringImpl to convert a string.

inline void MakeUTF8String(const _char16bit* instring_UCS2, std::basic_string<_uchar8bit> &out, bool appendToOut = false);

inline void MakeUTF8String(const _char32bit* instring_UCS4, std::basic_string<_uchar8bit> &out, bool appendToOut = false);

These functions increment and decrement a pointer to a UTF-8 encoded string to the next character. utf8data must point to a valid UTF-8 string.

inline void IncToNextCharacter(const _uchar8bit *&utf8data);

inline void DecToNextCharacter(const _uchar8bit *&utf8data);

Gets the length of a UTF-8 string in characters.

inline size_t GetNumCharactersInUTF8String(const _uchar8bit *utf8data)

Full source code can be found here: utf8utils.0.h

Thanks for reading. In my next post, I'll put together the complete utf8string class.

-------------
For the other post in this series:

Writing a STL-Style UTF-8 String Class Part 1
Writing a STL-Style UTF-8 String Class Part 2
Writing a STL-Style UTF-8 String Class Part 3
Writing a STL-style UTF-8 String Class Part 4

Writing a STL-style UTF-8 String Class Part 5

Sunday, December 8, 2013

Writing a STL-Style UTF-8 string class Part 2

Update: The latest code can now be found on GitHub. Check the post "STL-Style UTF-8 String Class now on GitHub" for more information.
____________________________________________________________________________

Over the next few days I will convert the basic string class that I made in my last post into a complete std::string-like UTF-8 string class. Before I add the real UTF-8 stuff, I want to add more stl-style related things to the class such as iterators and some standard types that I'll need.

First I'll add the following types

#ifndef WCHAR32BITS

// because signed or unsigned is not mandated for char or wchar_t in the standard,
// always use the char and wchar_t types or we may get compiler errors when using
// some standaard function

typedef char _char8bit; // for ASCII and UTF8
typedef unsigned char _uchar8bit; // for ASCII and UTF8
typedef wchar_t _char16bit; // for USC2
typedef std::uint32_t _char32bit; // for UTF32

#else

typedef char _char8bit; // for ASCII and UTF8
typedef unsigned char _uchar8bit; // for ASCII and UTF8
typedef std::uint16_t _char16bit; // for USC2
typedef wchar_t _char32bit; // for UTF32

#endif

The final UTF-8 string class should be able to process strings in different formats. These definitions will help keep those types organized without worrying about the compiler implementations. I'm also explicitly defining an unsigned 8-bit char type. This is because the sign of the char data type is not specified in the standard and can be signed or unsigned. For UTF-8 to work properly, it needs to be an unsigned 8-bit integer. The _char16bit and _char32bit types are being defined in terms of wchar_t to prevent compiler errors with std::wstring and so it'll work with the standard wide character string functions, such as wcscpy() and wcscmp().

Next these definitions are required by STL so they need to be added to our class. The types are fairly self-explanatory. For now the class will use _uchar8bit, but once I start adding the UTF-8 capabilities, I'll change the class to return _char32bit.

typedef _uchar8bit           value_type;
typedef _uchar8bit          *pointer;
typedef const _uchar8bit    *const_pointer;
typedef _uchar8bit          &reference;
typedef const _uchar8bit    &const_reference;
typedef size_t               size_type;
typedef ptrdiff_t            difference_type;

The class also needs an iterator. To build this iterator, I'll define a class using std::iterator as a base. The iterator will also use the std::bidirectional_iterator_tag. This means I'll need to define increment(++) and decrement(--) operators. I'll also make it iterator a template class so I won't have to write the code twice for the const version.

template 
class utf8string_iterator : public std::iterator
{
 private:
  pointer buf_;

  void inc()
  {
   // increments the iterator by one
   // result in undefined behavior (crashes) if already at the end 
   ++buf_;
  }

  void dec()
  {
   // decrements the iterator by one
   // result in undefined behavior (crashes) if already at the beginning
   --buf_;
  }

 public:
  // b should be a null terminated string in UTF-8
  // if this is the end start_pos should be the index of the null terminator
  // start_pos should be the valid start of a character
  utf8string_iterator(pointer b, size_type start_pos)
  {
   buf_ = &b[start_pos];
  }

  // b should already point to the correct position in the string
  utf8string_iterator(pointer b)
   :buf_(b)
  {
  }

  reference operator*() const
  {
   return *buf_;
  }

  // does not check to see if it goes past the end
  // iterating past the end is undefined
  utf8string_iterator &operator++()
  {
   inc();

   return *this;
  }

  // does not check to see if it goes past the end
  // iterating past the end is undefined
  utf8string_iterator operator++(int)
  {
   utf8string_iterator copy(*this);

   // increment
   inc();

   return copy;
  }

  // does not check to see if it goes past the end
  // iterating past begin is undefined
  utf8string_iterator &operator--()
  {
   dec();
   return *this;
  }

  // does not check to see if it goes past the end
  // iterating past begin is undefined
  utf8string_iterator operator--(int)
  {
   utf8string_iterator copy(*this);

   dec();

   return copy;
  }

  bool operator == ( const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ == other.buf_;
  }
 
  bool operator != (const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ != other.buf_;
  }

  bool operator < ( const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ < other.buf_;
  }
 
  bool operator > (const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ > other.buf_;
  }
  
  bool operator <= ( const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ <= other.buf_;
  }
 
  bool operator >= (const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ >= other.buf_;
  }
};
// End ---------------------

Now these types need to be added to the class definition. For now, I will not need to write a
reverse iterator. There is a STL class that I can use to do that.

// make our iterator types here
typedef utf8string_iterator   iterator;
typedef utf8string_iterator const_iterator;
typedef std::reverse_iterator   reverse_iterator;
typedef std::reverse_iterator const_reverse_iterator;
// End ---------------------

After adding a few more functions and reorganizing the class, I now have a class that's more than 50% like std::string, but it's missing the most important thing. It needs process UTF-8. In my next, I'll write all of the necessary UTF-8 functions in my next post.

Be sure to check out my previous post Writing a STL-Style UTF-8 string class Part 1 and my video presentation that gives a good introduction to UTF-8.

For the full source code: mystring.1.h

-------------
For the other post in this series:

Writing a STL-Style UTF-8 String Class Part 1
Writing a STL-Style UTF-8 String Class Part 2
Writing a STL-Style UTF-8 String Class Part 3
Writing a STL-style UTF-8 String Class Part 4

Writing a STL-style UTF-8 String Class Part 5

Writing a STL-Style UTF-8 string class Part 1

Update: The latest code can now be found on GitHub. Check the post "STL-Style UTF-8 String Class now on GitHub" for more information.
____________________________________________________________________________

I've really been into UTF-8 these days. I've made a video introduction for it on Youtube and I've written an article. Now I'm writing an std::string style UTF-8 string class that I hope will be useful to me and to others.

This will be a 5 part series that I hope to complete this week. The final installment will be a full article about the class that I'll also publish on Gamedev.net

I will set out to create a utf8string class that behaves as much as possible as std::string. I will have all of the same methods as std::string and will overload the cast operator to be able to be cast to std::string and std::wstring. The class will also support C++ STL-style iterators. The iterators will be constant only though because UTF-8 is a variable-sized type. It wouldn't be possible to supply a mutable reference to any character. Because of this, the iterator will always return an unsigned 32-bit int type or wchar_t on systems that use a 32-bit wchar_t type. I'll implement it everything from scratch at first so you'll be able to see everything that's going on, but after that, I'll use more from the STL (Standard Template Library) to make it better.

To begin, let's start with the shell of the basic class. Over the next few days, I'll slowly transform this into a utf8string class that behaves as much like std::string as possible.

class mystring
{
 private:
  // this implementation will use a null-terminated string
  unsigned char * buffer;
  // this is the size of the buffer, not the string
  size_t   reserve_size;
  // keep the size of the string so we don't have to count it every time
  size_t   used_length;

  // this will resize the buffer, but it will not shrink the buffer is new_size < reserve_size
  void growbuffer(size_t new_size, bool copy_data = true)
  {
   if(new_size > reserve_size)
   {
    unsigned char *new_buffer = new unsigned char[new_size];

    if(used_length && copy_data)
    {
     // copy the buffer
     memcpy(new_buffer, buffer, used_length);
     new_buffer[used_length] = 0; // ensure null terminator
    }

    delete [] buffer;
    buffer = new_buffer;
    reserve_size = new_size;
   }
  }

  size_t recommendreservesize(size_t str_len)
  {
   return (str_len + 1) * 2;
  }

  // str_len is the number of 8-bit bytes before the null terminator
  void copystringtobuffer(const unsigned char *str, size_t str_len)
  {
   if(str_len >= reserve_size)
   {
    growbuffer(recommendreservesize(str_len), false);
   }
   memcpy(buffer, str, str_len);

   used_length = str_len;

   // set the null terminator
   buffer[used_length] = 0;
  }

 public:
  // default constructor
  mystring()
  :buffer(0L), reserve_size(0), used_length(0)
  {
   growbuffer(32, false); // set the string to an initial size
  }

  // build from a c string
  // undefined (ie crashes) if str is NULL
  mystring(const char *str)
  :buffer(0L), reserve_size(0), used_length(0)
  {
   copystringtobuffer((unsigned char *)str, strlen(str));
  }

  // build from a c string
  // undefined (ie crashes) if str is NULL
  mystring(const unsigned char *str)
  :buffer(0L), reserve_size(0), used_length(0)
  {
   copystringtobuffer(str, strlen((const char *)str));
  }

  // construct from an unsigned char
  mystring(const unsigned char c)
  :buffer(0L), reserve_size(0), used_length(0)
  {
   // set the string to an initial size
   growbuffer(32, false);
   buffer[0] = c;
   buffer[1] = 0;
   used_length = 1;
  }

  // construct from a normal char
  mystring(const char c)
  :buffer(0L), reserve_size(0), used_length(0)
  {
   // set the string to an initial size
   growbuffer(32, false);
   buffer[0] = (unsigned char)c;
   buffer[1] = 0;
   used_length = 1;
  }

  // copy constructor
  mystring(const mystring &str)
  :buffer(0L), reserve_size(0), used_length(0)
  {
   copystringtobuffer(str.buffer, str.used_length);
  }

  // destructor
  ~mystring()
  {
   delete [] buffer;
  }

  // cast to a c-string
  const unsigned char *c_str() const
  {
   return buffer;
  }

  // assignment operator
  // we can define assignment operators for all possible types such as char, const char *, etc,
  // but this is not neccessary. Because those constructors were provided, the compiler will be
  // able to build a mystring for those types and then call this overloaded operator.
  // if performance becomes an issue, the additional variations to this operator can be created
  mystring& operator= (const mystring &rvalue)
  {
   copystringtobuffer(rvalue.buffer, rvalue.used_length);

   return *this;
  }

  // move assignment operator
  // should move the data to this object and remove it from the old one
  mystring& operator= (mystring &&rvalue)
  {
   buffer   = rvalue.buffer;
   reserve_size = rvalue.reserve_size;
   used_length  = rvalue.used_length;

   // clear the values in the other string
   rvalue.buffer   = 0L;
   rvalue.reserve_size  = 0;
   rvalue.used_length  = 0;

   return *this;
  }

  // request a new buffer size
  // this will resize the buffer, but it will not shrink the buffer is new_size < reserve_size
  // not useful unless the actual size of the string is known
  void reserve(size_t new_size)
  {
   growbuffer(new_size, true);
  }

  // appends a string to the end of this one
  // we can define this operator for all possible types such as char, const char *, etc,
  // but this is not neccessary. Because those constructors were provided, the compiler will be
  // able to build a mystring for those types and then call this overloaded operator.
  // if performance becomes an issue, the additional variations to this operator can be created
  mystring& operator+= (const mystring& str)
  {
   size_t total_length  = used_length + str.used_length;
   if(total_length > reserve_size)
   {
    // resize the buffer
    reserve(recommendreservesize(total_length));
   }
   strcat((char *)buffer, (char *)str.buffer);
   used_length = total_length;

   // set the null terminator
   buffer[used_length] = 0;

   return *this;
  }

  // returns a reference to the character at the index
  // doesn't throw exception. undefined if out of range
  unsigned char &operator[](size_t pos)
  {
   return buffer[pos];
  }

  // returns a const reference to the character at the index
  // doesn't throw exception. undefined if out of range
  const unsigned char &operator[](size_t pos) const
  {
   return buffer[pos];
  }

  // returns a reference to the character at the index
  // will throw an exception if out of range
  unsigned char &at(size_t pos)
  {
   // check range
   if(pos >= used_length)
   {
    throw std::out_of_range("subscript out of range");
   }
   // use operator defined above, will help us later
   return (*this)[pos];
  }

  // returns a const reference to the character at the index
  // will throw an exception if out of range
  const unsigned char &at(size_t pos) const
  {
   // check range
   if(pos >= used_length)
   {
    throw std::out_of_range("subscript out of range");
   }
   // use operator defined above, will help us later
   return (*this)[pos];
  }

  // overload stream insertion so we can write to streams
  friend std::ostream& operator<<(std::ostream& os, const mystring& string)
  {
   os << string.c_str();

   return os;
  }

  // overload stream insertion so we can write to streams
  // we can define this operator for all possible types such as char, const char *, etc,
  // but this is not neccessary. Because those constructors were provided, the compiler will be
  // able to build a mystring for those types and then call this overloaded operator.
  // if performance becomes an issue, the additional variations to this operator can be created
  friend mystring operator + (const mystring& lhs, const mystring& rhs)
  {
   mystring out(lhs);
   out += rhs;

   return out;
  }
};

That's a very simple string class. It can do many of the things that std::string can do but not all of them. Next I'll add some more things and give it more of an std::string feel.

For the full source code of this entry: UTF8 String Draft Header

I really want to thank Pete Goodliffe for inspiration. I've been meaning to make a UTF-8 string class for a long time now, but his article STL-style Circular Buffers By Example, made me want to go the extra mile with this.

-------------
For the other post in this series:

Writing a STL-Style UTF-8 String Class Part 1
Writing a STL-Style UTF-8 String Class Part 2
Writing a STL-Style UTF-8 String Class Part 3
Writing a STL-style UTF-8 String Class Part 4

Writing a STL-style UTF-8 String Class Part 5

Pages

Thursday, December 19, 2013

STL-Style UTF-8 String Class now on GitHub

Sunday, December 15, 2013

Squared'D Programming Project Videos

Wednesday, December 11, 2013

Writing a STL-style UTF-8 String Class Part 5

Tuesday, December 10, 2013

Writing a STL-Style UTF-8 String Class Part 4

Monday, December 9, 2013

Writing a STL-Style UTF-8 String Class Part 3

Sunday, December 8, 2013

Writing a STL-Style UTF-8 string class Part 2

Writing a STL-Style UTF-8 string class Part 1

About Me