您的位置:首页 >新闻资讯 > 正文

java爬虫如何使用http代理IP来提高工作效率

来源:互联网 作者:admin 时间:2019-06-13 11:33:08

java爬虫如何使用http代理IP来提高工作效率!爬虫技术的快速发展,反爬虫技术也不甘示弱,现在的爬虫越来越难爬,大多数的网站都有自己的反爬策略,有的反爬策略严格的让人无从下手,所以代理IP的质量有时候非常重要,本文简单介绍下java爬虫如何使用http代理IP来工作。


java爬虫如何使用http代理IP来提高工作效率


一、请求头的user-agent参数必不可少,而且要随机,这里是大坑,我之前就是没有随机,然后爬了几天就被人反爬了,醉了,我当时还以为代理的问题,一直跟客服沟通,说他们代理被封了,后来才发现是我的请求头里面的user-agent被封了,然后心里愧疚的跟客服小姐姐抱歉了下……尴尬。 user-agent是浏览器的标识,所以越多越好,大量的随机,跟代理ip一样重要!我先提供一部分,也放不了这么多。




String[] ua = {"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",


        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36 OPR/37.0.2178.32",


        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",


        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",


        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586",


        "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",


        "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",


        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",


        "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)",


        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.3 Safari/537.36",


        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 Core/1.47.277.400 QQBrowser/9.4.7658.400",


        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 UBrowser/5.6.12150.8 Safari/537.36",


        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",


        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 TheWorld 7",


        "Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0"};




二、访问来源referer。说实话之前一直没发现,后来是访问某网站的时候发现的,反爬做这么多干嘛,累啊,互联网,数据大家一起用嘛! 请求头的referer这个参数就是记录的来源。为什么要伪装这个参数。我详细的说明下,你来源不伪装,就直接请求别人的接口,凭什么,他这个接口可能只是给页面调用的。浏览器请求的时候都有来源,你不伪装,不就暴露了,具体传什么参数,不同的网站都不一样,可以F12看下浏览器请求的时候传的什么。


三、代理IP。优质代理ip必不可少,用免费的就不太好了,那有效率太低了,速度也慢。既然要爬数据,肯定要快,效率要高,代理ip的要求就比较高,而且要有效的数量比较多,不然别人网站升级什么的,你没爬完,爬虫程序就蹦了,这里推荐et代理的短效优质代理IP,日流水20万左右,有效率98%左右。


源码我就随便贴一下,应该是可以跑的,就是获取代理ip的url记得换下。




import java.io.IOException;


import java.util.*;


import java.util.concurrent.ExecutorService;


import java.util.concurrent.Executors;


import org.jsoup.Jsoup;


import org.jsoup.nodes.Document;


import org.slf4j.Logger;


import org.slf4j.LoggerFactory;


 


import net.sf.json.JSONObject;


 


public class Test {


    //获取代理ip,记得更换,我用的是et代理代理IP的,可以换成其他的网站的


    private final static String GET_IP_URL = "et代理API接口";


    public static void main(String[] args) throws InterruptedException {


        List<String> addrs = new LinkedList<String>();


        Map<String,Integer> addr_map = new HashMap<String,Integer>();


        Map<String,String> ipmap = new HashMap<String,String>();


        ExecutorService exe = Executors.newFixedThreadPool(10);


        for (int i=0 ;i<1;i++) {


            Document doc = null;


            try {


                doc = Jsoup.connect(GET_IP_URL).get();


            } catch (IOException e) {


                continue;


            }


            System.out.println(doc.text());


            JSONObject jsonObject = JSONObject.fromObject(doc.text());


            List<Map<String,Object>> list = (List<Map<String,Object>>) jsonObject.get("msg");


            int count = list.size();


 


            for (Map<String,Object> map : list ) {


                String ip = (String)map.get("ip");


                String port = (String)map.get("port") ;


                ipmap.put(ip,"1");


                checkIp a = new checkIp(ip, new Integer(port),count);


                exe.execute(a);


            }


            exe.shutdown();


            Thread.sleep(1000);


        }


    }


}


 


 


class checkIp implements Runnable {


    private static Logger logger = LoggerFactory.getLogger(aaa.class);


    private static int suc=0;


    private static int total=0;


    private static int fail=0;


 


    private String ip ;


    private int port;


    private int count;


    public checkIp(String ip, int port,int count) {


        super();


        this.ip = ip;


        this.port = port;


        this.count = count;


    }


 


    @Override


    public void run() {


        Random r = new Random();


        String[] ua = {"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",


                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36 OPR/37.0.2178.32",


                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",


                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",


                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586",


                "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",


                "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",


                "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",


                "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)",


                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.3 Safari/537.36",


                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 Core/1.47.277.400 QQBrowser/9.4.7658.400",


                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 UBrowser/5.6.12150.8 Safari/537.36",


                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",


                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 TheWorld 7",


                "Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0"};


        int i = r.nextInt(14);


        logger.info("检测中------ {}:{}",ip,port );


        Map<String,String> map = new HashMap<String,String>();


        map.put("waybillNo","DD1838768852");


        try {


            total ++ ;


            long a = System.currentTimeMillis();


            //爬取的目标网站,url记得换下。。。!!!


            Document doc = Jsoup.connect("目标网站网页的URL")


                    .timeout(5000)


                    .proxy(ip, port, null)


                    .data(map)


                    .ignoreContentType(true)


                    .userAgent(ua[i])


                    .header("referer","访问来源referer")//这个来源记得换..


                    .post();


            System.out.println(ip+":"+port+"访问时间:"+(System.currentTimeMillis() -a) + "   访问结果: "+doc.text());


            suc ++ ;


        } catch (IOException e) {


            e.printStackTrace();


            fail ++ ;


        }finally {


            if (total == count ) {


                System.out.println("总次数:"+total);


                System.out.println("成功次数:"+suc);


                System.out.println("失败次数:"+fail);


            }


        }


    }


 


}


相关文章内容简介