Ending and Starting Bytes for Images

So, I needed to find the starting and ending bytes for images and I would like to save them somewhere. Why not here? Please let me know if I need to or can make changes to the table. If we can get the end of file bytes, it will make extraction and manipulation in documents easier.

The starting bytes are well documented but the end bytes are not.

Here is what I found regarding the common formats.

Image Type Start Bytes End Bytes Start Word End Word
JPEG 0xd8 0xff 0xff 0xd9
PNG 0x89 0x50 0x4E 0x47 0x0D 0x0A 0x1A 0x0A 0x49 0x45 0x4e 0x44 –Specs– IEND
GIFF 0x47 0x49 0x46 0x38 0x3B GIF87a | GIF9a ; [9 bit ending is 101h]
TIFF-Motorola 0x4d 0x4d 0x00 0x2a

MM
TIFF-Intel 0x49 0x49 0x2a 0x00 II
PGM2 0x50 0x35 P5
PLain PGM 0x50 0x32 P2
PBM 0x42 0x4D P1
BMP 0x42 0x4D BM

*A GIF also marks itself by its format (7a or 9a to form the workd GIF8[format] in its magic number) and the end appears to be 101h with a EOF of 0x3B but this is a bit weird.
*The TIFF formats are apparently Big Endian for Motorola and Little Endian for Intel.
* The best statement I could find for a Tiff is that “Each strip ends with the
24-bit end-of-facsimile block (EOFB)” from a document refering to TIFF 1.0.
*A JPEG may also have the JFIF structure and be discoverable this way since JFIF is usually the first noticeable part of the image when converted to a string. (JFIF: 0x4a 0x46 0x49 0x46)
*PGM comes in two formats as does TIFF. While the tiff differences are listed, pgm differs in that plain pgm stores one image and pgm2 (both .pmg) stores more than one. Both are pure black and white, binary, photos.

Resources:
JPG: http://en.wikipedia.org/wiki/JPEG
PNG: http://en.wikipedia.org/wiki/Portable_Network_Graphic
GIF: http://en.wikipedia.org/wiki/Graphics_Interchange_Format | http://www.onicos.com/staff/iz/formats/gif.html
TIFF: http://www.fileformat.info/format/tiff/egff.htm

Morning Joe: Legality of Acquiring Scraped Data

One of my tasks at the entry level besides basic normalization, network programming, ETL, and IT work is to acquire data using just about anything. Being in the US this sort of data acquisition can be problematic.

I did some research since recent court rulings seem a bit mixed. Legally, in the US, there are a few factors that seem to be important.

Illegal acts obviously include targeting others in an attack. Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act. Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like that was is illegal at the felony level with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

Copyright law is starting to becom important as well though. Pure replication of data that is protected is illegal. Even 4% replication has been deemed a breach. With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties if somewhat knowingly or negligently serving this data to others. It is nearly impossible to tell if mixed data is obtained illegally though.

The following from the verified Wikipedia scraping entry where all of the cases are real says it all.

U.S. courts have acknowledged that users of “scrapers” or “robots” may be held liable for committing trespass to chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder’s Edge, resulted in an injunction ordering Bidder’s Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

Paywalls and Product offer another significant though easy to skirt boundary. When going behind paywalls, contracts are breachable by clicking an agreement not to do something and then doing it. This is particularly damaging since You add fuel to the protection of negligence v. willingness [an issue for damages and penalties not guilt] in civil and any criminal trials. Ignorance is no defense.

Outside of the US, things are quite different. EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape. They control the system in a very real way with their money in a way that they do not elsewhere, at least not as much despite being just as powerful in almost every respect.

The gist of the cases and laws seems to point towards getting public information and information that is available without going behind a pay wall. Think like a user of the internet and combine a bunch of sources into a unique product. Don’t just ‘steal’ an entire site protected site.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website or collect certain information. However, accessing protected data is not legal and just re-purposing a site seems to be as well. The legal amount of pulled information is determinable. Also, a public MLS listing lookup site with no agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair game since access is heavily guarded behind a wall of registration requiring some fakery to get to.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a million corporate researchers at your disposal.

As for company policy, it is usually used internally to shield from liability and serves as a warning but is not entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.