Honeytoken – a honeytoken is a tag place on information, within mailing lists, etc. that alerts a site admin that the information tagged has been taken.
Honeypot – a honeypot is a trap set to deflect, catch, or detect unauthorized use of a system of information.
Scraping – scraping occurs when a user or bot copies large amounts of information from a website. Scraping can be benevolent or malicious.
Malicious Scraping – malicious scraping occurs when user and bots copy site information and use it in ways the site owner never intended. For example, a bot crawls your online directory and copies all the listings. Overnight, a copycat site is created containing your listings. You now have a competitor who spent nothing to get into the market providing an identical service. Worse, the competitor can confuse your customers, provide a lesser service, and ultimately erode trust in your, the original site.
Proxy Server – a proxy server is a server that acts as an intermediary between a sender and receiver i.e client web browser and server. Proxy servers keep users anonymous, control the access of their users, and improve the user experience. Provessional scrapers use proxy servers to cover their tracks.
CAPTCHA – “Completely Automated Public Turing test to tell Computers and Humans Apart”. As the acronymn states, CAPTCHA’s are intended to keep bots from scouring a site; however, OCR’s and CAPTCHA farms help bots pass the CAPTCHA’s.
Rate Limit – a method to control the rate of site traffic.
Source Code – a list of text commands to be compiled or assembled into an executable program.
Blacklist – a list banning certain IP addresses from accessing a site.
Whitelist – a list that permits certain IP addresses access to the site.
Bots – Bots are software applications that run simple and repetitive automated tasks on the Internet.
OCR – Optical Character Recognition. Software that converts a CAPTCHA image into a readable text file allowing bots to solve CAPTCHA’s.
Spoofed User Agents – user agents may masquerade as a legitimate bot, when in reality they are not.
Robots.txt protocol – this protocol prevents legitimate, cooperative bots from accessing certain pages of a site that is otherwise publicly accessible.
SEO – Search Engine Optimization is the process of improving a website’s visibility on search engines through unpaid searches.