Converting A Byte Array Into A Bit Stream Using ColdFusion And BigInteger
The other day, I was faced with a situation where I had to split a set of bytes up into smaller sets of bits (smaller than the 8 bits that go into a byte). I was definitely stumped. Not only have I never done this before but, my understanding of bits (especially signedbits) is rather weak. After some trial an error, however, I was able to leverage Java's BigInteger class to help me turn a byte array into a bit stream that can read an arbitrary number of bits at a time.
So, again, before I show the code, let me once again caveat that my understanding of the lowlevel binary representation of numbers is not super strong. I can get it to work  but that doesn't mean that it will handle outlier cases in an appropriate manner.
That said, I created a ColdFusion component  BitStream.cfc  that can take a string or a byte array and present two methods:
 read( size )
 readBigIngeter( size )
Both of them do the same thing  read the given number of bits into a numeric result. But, the first one returns a primitive integer whereas the second one returns a Java BigInteger. The reason for the BigInteger version is that the number of bits being read may not fit into a normal 32bit integer (or is that 31 bits, less the signed bit  that's where I really start to get confused?).
Internally, you will see that they both use readBitInteger(); but, the read() version will simply return the .intValue() of the BigInteger. In either case, the stream will return 1 (or the BigInteger representation of 1) when there are no more bits to read.
Here is the BitStream.cfc ColdFusion component:
 <cfscript>
 // NOTE: CFScript tags used for Gist colorcoding. They can be removed.
 component
 output = false
 hint = "I provide a bitbased stream wrapper to an underlying byte array."
 {

 /**
 * I take either a binary value (byte array) or a string and provide readaccess to
 * the underlying bits. If a string is passedin, it is assumed to be UTF8 encoded.
 *
 * @output false
 */
 public any function init( required any input ) {

 // If the input is a byte array, we can use it asis.
 if ( isBinary( input ) ) {

 var byteArray = input;

 // If the input is a string, we can decode the bytes.
 } else if ( isSimpleValue( input ) ) {

 var byteArray = charsetDecode( input, "utf8" );

 // If input is not supported for conversion.
 } else {

 throw(
 type = "InvalidInput",
 message = "Input must be one of the following: ByteArray, String, Integer"
 );

 }

 // We're going to use the BigInteger to access the individual bits.
 source = createObject( "java", "java.math.BigInteger" ).init( byteArray );

 // If the leading bit(s) are zeros, they are not included in the bit length of
 // the BitInteger, which will mess up our read. But, since this is all coming
 // from a byte array, we can determine the bit length using the number of bytes.
 bitLength = ( arrayLen( byteArray ) * 8 );

 // Since the bits in the BigInteger are zerobased, we need to start at n1.
 bitIndex = ( bitLength  1 );

 // Each read action will start with this result and then start shifting and
 // settings bits (which will return a new BigInteger instance).
 resultTemplate = createObject( "java", "java.math.BigInteger" )
 .init( javaCast( "string", "0" ) )
 ;

 // When there are no more bits to read in the underlying source, we are going
 // to return (1) to indicate the end of the stream.
 nonResultTemplate = createObject( "java", "java.math.BigInteger" )
 .init( javaCast( "string", "1" ) )
 ;

 }


 /**
 * I read the given number of bits from the stream and return them as the rightmost
 * bits of an integer. If there are not enough bits available for a fullread, the
 * result will be rightpadded with zeros. When there are no more bits to read, a 1
 * is returned.
 *
 * @output false
 */
 public numeric function read( required numeric size ) {

 return( readBigInteger( size ).intValue() );

 }


 /**
 * I read the given number of bits from the stream and return them as the rightmost
 * bits of a BigInteger. If there are not enough bits available for a fullread, the
 * result will be rightpadded with zeros. When there are no more bits to read, a 1
 * is returned (in the form of a BigInteger).
 *
 * @output false
 */
 public any function readBigInteger( required numeric size ) {

 // If we've gone past the bounds of BigInteger, return 1. There are no more
 // bits to read from the underlying source.
 if ( bitIndex < 0 ) {

 return( nonResultTemplate );

 }

 // Start out result as zero.
 var result = resultTemplate;

 // When read read from the stream, we'll have to see if we miss any bits, which
 // will have to be rightpadded with zeros.
 var missingBits = size;

 // Keep reading a bit at a time until we fulfill the size requirement or, until
 // we run out of bits to read.
 while ( missingBits && ( bitIndex >= 0 ) ) {

 // Will add a zero as the rightmost bit.
 result = result.shiftLeft( javaCast( "int", 1 ) );

 // If the source bit is set (ie, nonzero), then toggle the rightmost result
 // bit (the one we just leftshifted into the result).
 if ( source.testBit( javaCast( "int", bitIndex ) ) ) {

 result = result.setBit( javaCast( "int", 0 ) );

 }

 // Update our whileconditions.
 missingBits;
 bitIndex;

 }

 // If we ran out of bits, midread, then we have to rightpad with zeros.
 while ( missingBits ) {

 result = result.shiftLeft( javaCast( "int", 1 ) );

 }

 return( result );

 }

 }
 </cfscript>
To test this, I will convert a simple string  "abc"  into a set of bits and then compare that set to the bits that come out of the stream:
 <cfscript>

 // Each of the characters in this string is represented by a byte, which is composed
 // of 8 bits. We're going to wrap this in a BitStream and then read chunks of the
 // bits out, 5 at a time.
 input = "abc";

 // Create the bit stream.
 stream = new BitStream( input );


 // To demonstrate that the bit stream works, let's output the bits of the input
 // string. This way, we can see how our 5bit reads line up with the bits of the
 // entire stream.
 for ( character in listToArray( input, "" ) ) {

 bitValue = formatBaseN( asc( character ), 2 );

 // NOTE: We are using numberFormat so we don't lose leading zeros.
 writeOutput( numberFormat( bitValue, "00000000" ) );

 }

 writeOutput( "<br />" );


 results = [];

 // Read first 5 bits.
 result = stream.read( 5 );

 // When we run out of bits, the stream will return 1.
 while ( result != 1 ) {

 // Get the underlying bitstring in our read chunk.
 bitValue = formatBaseN( result, 2 );

 // NOTE: We are using numberFormat so we don't lose leading zeros.
 arrayAppend( results, numberFormat( bitValue, "00000" ) );

 // Read the next chunk of bits.
 result = stream.read( 5 );

 }

 // Now that we've read all the chunks of 5 bits, output with and without a space
 // delimiter. This will help us see if the resultant bits line up with the input
 // stream AND also help us see where the chunks were bounded.
 writeOutput( arrayToList( results, "" ) );
 writeOutput( "<br />" );
 writeOutput( arrayToList( results, " " ) );

 </cfscript>
In this code, I'm using numberFormat() to make sure that I don't accidentally lose leading zeros in the output. This way, it will be easy to see where things line up. And, when we run the above code, we get the following page output:
011000010110001001100011
0110000101100010011000110
01100 00101 10001 00110 00110
As you can see, the first line (input bits) and the second line (stream bits) match up, except for the extra zero on the stream. The extra zero is a rightpadded readresult. In order to make sure that N bits can always be read (assuming some bits are available), we need to rightpad with zeros. And since 5 (our read size) doesn't evenly divide into 24 (our total number of input bits), we need to make up the difference with zeros.
The third line contains the same data as the second line, but with a space between each read result. This way, we can easily see how the input bits map onto the stream bits.
Right now, this was just a stepping stone to a bigger concept that deals with baseencoding. But, more on that to come.
Reader Comments
When you say "signedbit" that's confusing me  generally I know the bit that signs a byte as a "sign bit" or "signing bit", and then the byte itself can be a "signed byte"
I'm not trying to be pedantic  but can you actually "sign" a bit  isn't it just 0 or 1, as opposed to 1 or 1?
@Pete,
You are correct  I was not speaking accurately. You cannot sign an individual bit  they are always 1 or 0. I was referring to the bit that indicates the sign of a set of bytes (ie, the leftmost bit).
Here's a bit about signed integers (pun intended).
An Nbit unsigned integer can only store positive values from zero to 2^N1. So an 8bit unsigned integer can store values from 0 to 255 decimal. In most cases the rightmost bit is the least significant and the leftmost is most significant, just like decimal numbers, except that in binary we're dealing with powers of 2 instead of 10. In this case, the rightmost bit represents 2^0, followed by 2^1, 2^2, etc. up to 2^7 for the leftmost bit.
0000 0010 = 2
0001 0000 = 16
1000 0000 = 128
1010 1010 = 128 + 32 + 8 + 2 = 170
1111 1111 = 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 255
For signed integers, the leftmost bit (the most significant bit) is flipped to represent the negative of the amount it would represent in an unsigned integer. This "shifts" the range available to have zero in the middle of the range (well, almost). For an 8bit signed integer, the range is 128 to 127. You calculate the represented value in the same way, but remember the MSB is negative.
0000 0010 = 2
0000 1111 = 15
0111 1111 = 64 + 32 + 16 + 8 + 4 + 2 + 1 = 127
1000 0000 = 128
1010 1010 = 128 + 32 + 8 + 2 = 86
1111 1111 = 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 1
So for positive integers up to 127, the bits are the same as those used for unsigned integers. But, when you flip the most significant bit to 1, the result goes all the way to the negative extreme of the range and counts back up. With signed integers, the value 1 is always represented by all 1's.
Any number with a 1 for the most significant bit will be negative. That's why it's sometimes referred to as the "sign bit."
@Darren,
Very interesting. I've never really looked at the sign stuff, but that was definitely different than the mental model that I had made up. I had sort of just assumed that the MSB was the negative sign; BUT, that nothing else changed. I pictured it like base10, where the only difference between 456 and 456 is the presence of the negative sign. I didn't realize that the logic of the set bits also changes.
That's super interesting. So, the bits always "add". It just depends on whether they adding to "0" or to "128".
Am I correct in assuming that there is no way to look at a byte and tell whether or not it is signed? Right? It seems that can only be known from the context, not from the data. So, for example, if a Database has a schema that defines an unsigned INT, the only thing that changes is the schema definition and not the actual data.... if that question makes sense.
@All,
Also, as a followup, I was playing with this in an attempt to figure out how to implement Base32encoding:
www.bennadel.com/blog/2687usingbase32encodinganddecodingincoldfusion.htm
In Base32, you have to split up the byte "stream" into chunks of 5bit bytes which get encoded using a Base32 alphabet.
@Ben,
Yep, if you just have a set of bits in front of you and nothing else, there's no way to know what they represent without applying some sort of data type. An arbitrary set of 32 bits could be a 32bit signed or unsigned integer, could be a floating point number, could be two 16bit or four 8bit integers, could be anywhere from 14 UTF8 characters. It's all in how you interpret them. :)
The technical term for representing integers this way is "two's complement," and the main reason it's used is for consistency in arithmetic operations. With two's complement, the lowlevel addition, subtraction and multiplication operations remain the same whether the value is signed or unsigned, which does not hold true for the signmagnitude system (using the MSB as a negate "flag"). The reasons behind that get a little complicated, but if you're interested in a math adventure, just Google it!
Also, fascinating stuff with your Base32 component. Thanks for sharing that. Coldfusion does not make bit splitting simple, but it's nice to know you can get it done!
@Darren,
It is nice that the math stays the same (less the sign bit). It's interesting to think about the implementation at this level  much different than I'm used to thinking. For the most part, I'm just happy when I can get UTF8 characters to show up properly on a website :D
As far as bitwise stuff in ColdFusion, I'm still learning. I've the bitAnd/bitOr functions before. But, there are even a number of bit functions that I don't understand (the ColdFusion documentation is sometimes very lacking). But, I might go down that rabbit hole for a bit.
That's the crazy thing about programming  every hurdle is actually a rabbit hole :
@Darren,
I have to say, your explanation really saved my bacon!
www.bennadel.com/blog/2689creatingsignedjavabytevaluesusingcoldfusionnumbers.htm
I was totally stumped on how to create a Java byte, which is a signed value, using a ColdFusion number, which is 32bits. If you hadn't told me how twoscompliment works, I would have been forever lost! Much thanks!
@Ben,
I'm so glad I could help! I rejoice any time bacon is saved. Mmmm...
Darren
@Darren,
Ha ha, mmmm, bacon :)