Advertisement

What are bytes

Started by July 30, 2017 03:36 PM
12 comments, last by Satharis 7 years, 3 months ago

Hi guys,

 

so I feel like I'm going to have to mess around with java functions for a while because this is confusing me.

I have a java server and a GML client, the client sends information to java via buffers, it has a few data types, string, u8, u16, signed too, long and so on. I could work with all that because I knew that I wanted to use the relevant data type as in if i'm working with strings I'll use the buffer_string type, if I'm working with small values up to 256 i think I'd use buffer_u8, and so on keeping the buffer size large enough for it's purpose and not much larger. Now that I have a java server a few things confuse me, things that happened behind the scenes with Game maker studio, for 1 there is overhead stuff that was going on, there's an option with gml that lets you choose between gml overhead stuff and the other that removes that.. the overhead was like handshaking or something like that.. I believe there are other things that also play a role like sending buffer sizes. But I might be wrong. 

What I need to do is see exactly what is being sent from the gml client, I don't know how to do that, what I've tried is sending an u8 value 0-256 ( i believe) from gml client to java server receiving as type "Short", I sent a single value of '5'. Now what I don't understand is that there has to be more to it, I'm not receiving this 'buffer_size' that gml is supposedly sending. or rather if I am then how do i see it ? 

Which brings me to array of bytes, I know what an array is, I dont really know what bytes are and yet I work with them, I know they're larger than bits and smaller than kilo bytes, a way of measuring the size of something, but what exactly ? 

When I am streaming data back of forth regardless of data type, am I simply sending bytes back and forth ? The reason I ask is because I don't have a good understanding of how to set up the code on the server and possibly on the client to receive the information properly and in order. 

Originally in GML the client for example will send a string/integer/etc to the server, it will tell the server the size of the string and the data type used so that the server knows that a string is being received and this is the size of it, honestly I never asked why the size of it mattered. couldn't it just read until a line break or something ? I'm sure the devs of the language had their reasons, I'm primarily concerned with what's relevant to my problem.  Anyway and the server will read it, that was all there was to it for me. now with java I believe there's much more to it, in java I don't see functions like..

 

output = new DataOutputStream(my_Sock.getOutputStream());

output.writeString

output.write'u8

output.write'u16

 

It doesn't seem as simple as that, well I do see I've got writeInt and writeChars which seems like what I would think about using, but under writeInt F2 brings this up:

  • writeInt

    
    public final void writeInt(int v)
                        throws IOException
    Writes an int to the underlying output stream as four bytes, high byte first. If no exception is thrown

I've no idea what writing something as four bytes means, and high byte first ? what's a high byte. In fact when ever it talks about bytes and so on I'm expecting this stuff to be sent/received, so I'm expecting that four bytes of data is going to be sent if I use output.writeInt(...); but I'll pretend to know what bytes are and say what if I use values in the writeInt method larger than four bytes? 

 

As you can see there's a lot of confusion here but to the trained eye someone might be able to suggest a course of action like what manual or such can I look at to get a better understanding of what's going on here.

Whilst I just leave this here and hope someone picks it up I'll be reading on further, thanks for reading hope to get some replies. :) 

 

 

Bytes are well documented. Do a Google/Wikipedia search for definitions before you ask, so that your questions can be more specific. :)

A byte is an amount of storage 8 bits of data (at least in every situation you will ever encounter). Therefore it is broadly equivalent to the 'u8' type you're working with.

All data transmitted or stored is essentially sending bytes back and forth. The computer programs at each end then decide how to interpret those bytes - for example, 1 byte on its own might be a 'u8', 4 bytes in a row might be considered an 'int', 8 bytes might be a 64 bit pointer, and so on.

When understanding what is meant by "high byte first" you need to understand how numbers are encoded in binary. (Read this paragraph, then read this - https://en.wikipedia.org/wiki/Binary_number#Counting_in_binary) When we write a number in decimal, we write the 'high digit first', so 145 is one hundred, four tens, and five ones. A hundred is the 'high' amount, and the ones are the low amount. Binary doesn't use digits from 0 to 9 like decimal does - it uses digits from 0 to 1. (Binary digits are "bits".) This means that instead of each column signifying ones, tens, hundreds, thousands, etc, each column signifies ones, twos, fours, eights, sixteens, etc. Each byte has 8 'columns' of binary digits, i.e. 8 bits, and when we send the 'high byte' first, we're sending the byte with the largest column values first in the list.

Because a byte is just 8 bits, it tells us nothing about what sort of information it is meant to carry. This is why 2 networked programs need to know what to expect - so if you're sending an int, you need to know to expect 4 bytes and what order they're going to arrive in. If you want to send a string, you need to know what format it takes - some systems send an int to mark the length followed by each byte in the text, and some systems send each byte in the text and a delimiter (i.e. what you call "a line break or something"). Both ways exist and are used. And a 'short' is usually a 16-bit value of 2 bytes. Scan this list - https://en.wikipedia.org/wiki/C_data_types - but be aware that type names can vary from language to language, and sometimes from platform to platform. e.g. an 'int' is usually 32 bits on a 32-bit operating system and 64 bits on a 64-bit operating system.

I would warn you that if you're not yet familiar with 90% of this stuff then trying to do low level networking between 2 different languages is going to be very tricky. You might want to keep practising with just 1 language first.

Advertisement

So 8 bits in a row makes 1 byte, and then as you say depending on the amount of bytes in a row changes the way it's interpreted ? 

So I had a look at binary counting, so 00100 is four in binary. and five would be 00101. both of them are five bits though ? Without both the client and server knowing what the values represent so without the metadata saying if you get these eight bits then it's a byte.. Can you tell me this, so if I send a 8 bits from client to server, the server knows it's a byte, but because of some calculation in the code, when I sent the byte the server can tell that the byte represents the number 5 because of the switch positions (the bits) being 0 or 1 ?

Like I'm starting to see the use of bits if I'm right about the above, but how far can bits represent things, bits alone mean nothing unless grouped up into bytes ? mainly because numbers and characters and such can't be represented in bits alone ? I'm still missing something here. 

Should I be looking at how bits/bytes and such are used when it comes to Unicode ?

https://en.wikipedia.org/wiki/Unicode

Or rather ascii ? Not sure but I found a page regarding Binary code. I'm guessing that should provide some leads.

So I just found that 01100001 can be 97 but it can also be the lower cased 'a' character. depending on the encoding. Is this the reason data types are specified ? 

8 bits, in order, make a byte. There are no 'rows'. It's 8 binary digits.

00100 is indeed equal to 4 in decimal, 00101 is equal to 5 in decimal, and as typed they are both 5 bits.

You could also just write 100 and 101 for 4 and 5 - the leading 2 binary digits are meaningless, just like if I said 00123 was one hundred and twenty-three.

When you read/write/send/receive a single byte, the order of the bits is fixed, and the number 5 would normally be represented as 00000101. (It might actually get sent the other way around across the network, but you don't need to worry about that.)

In computing, bits represent EVERYTHING. Literally everything in memory, in code, in data, is represented by bits, aggregated into bytes, which in turn represent everything else. I don't know why you say "numbers can't be represented in bits" because we've just shown above how the number 4 and 5 are represented as bits. As for other types, the letter 'A' is typically assigned the number 65, which is represented as 01000001. Images are represented as pixels, each pixel is 3 numbers, and each number is stored as above. And so on. Everything is bits.

Unicode is a completely unrelated issue - if you want to know more about that, best to ask separately. ASCII is a much simpler system, designed for American English.

Data types tell a program how to make sense of the bytes it is given. That's how mere data becomes information.

I said that because I thought that bits were too small to represent numbers. What I'm starting to understand is that the values bits represent is based on which table is used so for example without a table like the ASCII table we wouldn't be able to have characters and such. we'd still only have ones and zeros. 

I was trying to see the significance between having multiple data types, I couldn't really find a question regarding it that mattered other than one thing that comes up, if 01100001 can represent 97 and a, then how does the computer tell which one is say ..being received from a buffer ?  I thought maybe if the data type was string instead of int then it would know it's an 'a' but strings can have numbers in them too. 

so the guy in this post is saying that when ever the computer receives input the data type for each character/letter or what ever the word for it is - is remembered.. that kind of would make sense .. I'm wondering how a computer remembers this stuff. 

But now what if I've got a string "Hello 1234" and I send that to a client, I'm not sure what the binary values for H and 1 is but pretending they're the same, how then would the client know that the one's a character and the other the number ..and not "1ello 1234/Hello H234"

 

 

One bit is large enough to represent two numbers: zero and one. Two bits is enough to represent four numbers: zero, one, two three. Each bit you add doubles the range of numbers you can represent. So you can represent any whole number with bits - you just need to have enough bits. Just like you can represent any whole number with decimal digits - you just need enough digits. Bytes are a way that computers group bits together to make them convenient to manipulate, but they are not an essential part of the process of representing information with bits. In fact you'll see this when you look at broadband or networking speeds, usually measured in megabits or gigabits per second - they treat data as a long stream of bits rather than of bytes.

The computer can't tell what a byte means when it reads it from a disk or from a network. It has to be told how to interpret it, and that's what the data type means. I could write a program that reads the first byte from a file and treats it as a letter, or as a number, or as a boolean, or as a grey scale value, or anything really. I write a program and choose the data type, and that determines how it handles the bytes - but the bytes themselves have no idea what they are 'supposed' to be.

As such, it's not true that "when ever the computer receives input the data type for each character/letter or what ever the word for it is - is remembered" - the program doing the receiving must use one of 2 strategies:

  1. Assume the data type - as in my example above, where I could choose any of many different assumptions. If we read bytes from the keyboard, we assume they are characters (letters/numbers). If we read bytes from a file called "picture.bmp", we assume they will make up part of a picture. We establish conventions so that we can write programs that make reasonable assumptions about the data they receive. But notice that this is only a convention - you can open "picture.bmp" up in Notepad if you like, and it'll show you those same bytes as if they were text.
  2. Read some extra data which tells it the data type - this is how a lot of save/load systems work, by writing out a number that represents the type. Perhaps 1 is an int, 2 is a bool, 3 is a string, etc. (You may notice that this just moves the problem - how does the program 'know' that ints are '1', bools are '2' - here, we're back to option 1, where we establish a convention for reporting types to other programs.)

If you sent the string "Hello 1234" then actually that would usually be 9 bytes: 72, 101, 108, 108, 111, 32, 49, 50, 51, 52 in decimal, assuming we sent it as ASCII text. The receiving program would have to know to expect text, and would read in each byte and treat it as an ASCII character accordingly. These are all different characters so they all have different codes.

Now, to rework your example, imagine we wanted to send the word "Hello" followed by the number 50. You could write that as bytes 72, 101, 108, 108, 111, 50. But the receiving program needs some way of knowing that bytes 1 to 5 are characters and byte 6 is a (one byte) number - otherwise it has absolutely no way of knowing the difference. That's where we'd pick some sort of a convention - such as writing the receiving program to explicitly expect 5 characters and 1 number.

Advertisement

Typically the bit level stuff is handled at the hardware level. At the software level you'll open a socket, assign it to a port number. Then you specify the number of bytes you want to read from the socket. It will behave differently depending on if it is blocking or non blocking but bottom line you will receive an array or stream of bytes. Then it's a matter of a) verifying if it's junk or noise and b) parsing the bytes into data you can work with (characters, Booleans, long int, doubles whatever). At a higher level you can automatically try to parse the bytes you get into a data structures (called deserializing). Bottom line, the bytes are meaningless without context. A lot of higher level libraries like HTTPClient from .net framework abstract all this complexity so you only have to dive into all of this complexity unless you are talking to a device that communicates with a proprietary protocol. If you want to get into the guts of how this works I suggest researching Modbus. You can read about it here:

http://www.simplymodbus.ca/FAQ.htm

Hope that helps.

Am I correct in saying that a MB can hold 1024 ASCII characters, since these characters include the alphabet, spacing and numbers I need to create a byte array that will store say chat messages, I don't want to make it too small and will adjust it later but I just want to confirm before I continue. I'll be converting bytes to the UTF-8 table, from what I read there are different levels at which bytes are used. 1 byte will support all the ASCII characters, 2 bytes are for other languages, 3 for japanese and so on and yeah I just want to focus on 1 byte for now. 

 

I think Java has a library that lets me convert bytes to the UTF-8 table. 

So originally I wanted to use a BufferedInputStream to read bytes sent from a client to the server(yes I know I have a post for server stuff but shit which thread do I post this on), from there I use a DataInputStream with the BIS, from my understanding BIS reads bytes, DIS converts bytes to a data type. 

wait ..I just saw DIS has .readUTF! Does that mean that if someone sends the server a string in Japanese, the server will be able to read the bytes and convert it to Japanese ? 

5 minutes ago, Xer0botXer0 said:

Am I correct in saying that a MB can hold 1024 ASCII characters, since these characters include the alphabet,

In ASCII (ISO 8859), each character is represented by 1 byte.

1 MB = 1,024 kB = 1,048,576 byte.

Thus, 1 MB can hold 1,048,576 ASCII characters.

 

In UTF-8, each character uses between 1 and 4 bytes.

Hello to all my stalkers.

Oh snap,

I forgot about kilobytes.

Thank goodness. 

This topic is closed to new replies.

Advertisement