Advertisement

Serving large files over TCP

Started by January 13, 2015 07:06 PM
27 comments, last by hplus0603 9 years, 10 months ago

I didn't mean to store them all. That's just the worst case scenario in case the clients are connected and requesting these files at the same time. After you downloaded them from s3, wouldn't these files, at some point in time, reside in memory, even for a short while?

Only if you keep the entire file there. But, as I said, you could probably stream them to/from disk so you don't have to keep the entire file in memory, or as others said if you forward them directly to the user while you download them yourself.

What Bob and everyone is saying is....


 
std::fstream f;
 
f.open(...);
 
f.seekg(0,f.end);
size_t m = f.tellg();
f.seekg(0,f.beg);
 
size_t i, l;
i = 0;
while( i < m  )
{
l = 1024;
if(l < m - i )
l = m - i;
f.read( buff, l )
send( skt, buff, l, 0 );
}
 
f.close(...)

aka stream it from file to the network.

Notice how only a tiny part of the file ever resides in memory (only 1024 bytes in that code).

stream it from file to the network.


Yes, this is what we're proposing, although the code you wrote is not very efficient.
Both Windows and Linux have system calls that let you transfer data from one file descriptor to another.
For Linux, look at sendfile(): http://linux.die.net/man/2/sendfile
For Windows, look at TransmitFile(): http://msdn.microsoft.com/en-us/library/ms740565(v=VS.85).aspx

For the purposes of serving files out of S3 to only selected users, you can use CloudFront signed URLs: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html
The bandwidth cost out of CloudFront will generally be lower than the bandwidth cost of serving the same content out of S3 buckets directly, so doing this is a cost savings compared to the initial proposal.

I still don't understand why the data needs to live in S3 if it's served by hosted servers, though.
enum Bool { True, False, FileNotFound };
Advertisement

I am serving embedded devices with limited network capabilities. I don't think it can handle HTTP, just raw TCP. One of the reasons why I can't have a direct url. I apologize I should've informed you guys about this limitation earlier.

Even through the TCP, it cannot handle large packets. I have to split them up, put some metadata to identify the chunks, so they can be pieced back together. I guess what would make sense is to introduce some sort of LRU caching, download the files from S3 and keep them in a cache on the server hard drive, as you guys said. As the files are requested, read them in chunks and send them off.


I still don't understand why the data needs to live in S3 if it's served by hosted servers, though.

These files are uploaded somewhere through some other servers, and they stuff those in on S3. And I need to pull them from S3 and serve them to these embedded devices. Not exactly what I am trying to do here, but imagine users upload mp3s from their PC, and pull their iPhone, walk away to the bus station and play them off their 4G network.



For the purposes of serving files out of S3 to only selected users, you can use CloudFront signed URLs: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html
The bandwidth cost out of CloudFront will generally be lower than the bandwidth cost of serving the same content out of S3 buckets directly, so doing this is a cost savings compared to the initial proposal.

Not exactly what we need right now, but I will keep this in mind. I have a feeling I might need this later.


I am serving embedded devices with limited network capabilities. I don't think it can handle HTTP, just raw TCP.

Um... have you ever tried something like this?


telnet www.gamedev.net 80
GET /uploads/showdown/2015_01/sss_03ba33793c84b6dab1decb9180328664.png HTTP/1.0
(hit return twice)

(this downloads the IOTD)

The overhead for HTTP is 107 bytes for a 292kB file, or 0.03%. Do you really think there is a noticeable difference for an embedded device which it can't handle, limited as the embedded device may be? (In particular the iPhone example is pretty funny since iPhone has a fully fledged web browser which can do a LOT more than just that). HTTP is easy enough as a protocol so you can implement the very basics (that is, a simple GET) by hand in like 5 lines of code, if there is no existing HTTP library on that embedded system.

cannot handle large packets. I have to split them up, put some metadata to identify the chunks

Why? You could just read a kilobyte or two from the socket (or whatever amount you want), hand it over to the MP3 player, and read another kilobyte or two. What's the problem that reading from a TCP stream can't solve? You know, you need not care what size of packets the device supports, TCP does that.

Advertisement

I don't know the exact details of these devices. I don't do the development on them, but I was informed that's their limitation. HTTP protocol isn't that complicated, but perhaps adding logic to handle HTTP protocol is a constraint on the firmware side. I don't know either why there's limitation on the packet size either.


HTTP is easy enough as a protocol so you can implement the very basics (that is, a simple GET) by hand in like 5 lines of code

Yes, he can, if he is in control of the source for the embedded side. It might be a closed system he has no input over...

Edit: ninja'd :(

I don't think it can handle HTTP, just raw TCP.


But, you say that you will control the chunking of the transfer? Thus, you can change how the device interprets the data?

To send a HTTP request, you need to send this:

GET /some/url HTTP/1.0\r\n\r\n

In the response, you can receive and throw away all bytes, until you see "\r\n\r\n," at which point you can start downloading, until it gets to the end. This is approximately the same hardness as raw TCP connections, except when doing it this way, you can use existing HTTP servers, CDNs, and the like, which would likely simplify your deployment and management.

Even through the TCP, it cannot handle large packets. I have to split them up


If the devices have small receive buffers (no large IP packets allowed,) that's handled by the TCP/IP stack on the devices and the TCP protocol; you don't need to worry about this at the higher layer.

If the problem is that the devices want to make sub-range requests, then you can still use HTTP, and the device would add a HTTP header for the byte range they want. Again, doing this in HTTP is approxmately the same hardness as doing it in raw TCP.

There may still be some real reason to use raw TCP, but I'm not currently convinced that's the right way to do it. If you still want to go that way, hopefully the information in this thread will help you.
enum Bool { True, False, FileNotFound };

Well, you really need to know the protocol of the device that will contact you up. If you can recognize it as HTTP requesting, bare in mind you can recieve, partial requests, normal requests, connection close, connection keep etc, and even the actual reciever deeper tcp behaviour of client may differ greatly (some clients may wish to hang on you without reading for very long, making your sends to just return WOULD_BLOCK :) ).

But if you utilize http protocol, or implement enough parts of it, you make get along with many devices.

This topic is closed to new replies.

Advertisement