Evolución y tendencias actuales de los Web crawlers

Fernando Iván Camargo Sarmiento; Sonia Ordóñez Salinas

doi:10.14483/udistrital.jour.reving.2013.2.a02

Authors

Fernando Iván Camargo Sarmiento Universidad Distrital
Sonia Ordóñez Salinas Universidad Distrital Francisco José de Caldas

Keywords:

Natural Language Processing, crawler, search engine, Web crawler, social network, social network crawler. (en).

Keywords:

Procesamiento de Lenguaje Natural, rastreador, buscador, Web crawler, redes sociales, rastreador social. (es).

Downloads

Full Text HTML Available Metrics References How to Cite

References

A. Lipsman, "Social Networking Explodes Worldwide as Sites Increase their Focus on Cultural Relevance," comScore2008.

A. L. Zain, "Futuro digital Latinoamérica 2013: El estado actual de la industria digital y las tendencias que están modelando el futuro," ComScore2013.

M. Najork, "Web Crawler Architecture," Encyclopedia of Database Systems, 2009.

J. Seguic, "El Crecimiento de Redes Sociales en América Latina: La Influencia de Los Medios Sociales en el Escenario Digital de América Latina. Septiembre 2011.," ComScore2011.

P. Jackson and I. Moulinier, Natural language processing for online applications: text retrieval, extraction and categorization: John Benjamins Pub., 2007.

G. Piatetsky-Shapiro and W. Frawley, Knowledge discovery in databases: AAAI Press, 1991.

A. H. Tan, "Text Mining: promises and challenges," South East Asia Regional Computer Confederation, Sigapore, 1999.

M. Delgado, N. Marin, D. Sanchez, and M. A. Vila, "Fuzzy association rules: general model and applications," Fuzzy Systems, IEEE Transactions on, vol. 11, pp. 214-225, 2003.

C. D. Manning and H. Schütze, Foundations of statistical natural language processing: MIT Press, 1999.

O. Etzioni, "The World-Wide Web: quagmire or gold mine?," Commun. ACM, vol. 39, pp. 65-68, 1996.

S. K. Madria, S. S. Bhowmick, W. K. Ng, and E.-P. Lim, "Research Issues in Web Data Mining," presented at the Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery, 1999.

Springer-Verlag, Ed., Advances in Web Mining and Web Usage Analysis: 9th International Workshop on Knowledge Discovery on the Web, WebKDD 2007, and 1st International Workshop on Social Networks Analysis, SNA-KDD 2007, San Jose, CA, USA, August 12-15, 2007. Revised Papers. Springer-Verlag, 2009, p.^pp. Pages.

K. Oyama, H. Ishikawa, K. Eguchi, and A. Aizawa, "Analysis of Topics and Relevant Documents for Navigational Retrieval on the Web," in Web Information Retrieval and Integration, 2005. WIRI '05. Proceedings. International Workshop on Challenges in, 2005, pp. 157-163.

C. Manning, Raghavan, P., Schütze, H, An introduction to information retrieval, 2009.

R. R.-N. Baeza-Yates, Berthier, Modern Information Retrieval: Addison-Wesley Longman Publishing Co., Inc. , 1999.

R. R. Korfhage, Information storage and retrieval: Wiley Computer Pub., 1997.

E. Lorenzo. (2005, 2011-10-01). Recuperación de información basada en contenido. Material de estudio, Doctorado en Sistemas Software inteligentes y adaptables. Available: http://trevinca.ei.uvigo.es/~evali/doctorado0507/sri/

M. W. Berry, Z. Drmac, and E. R. Jessup, "Matrices, vector spaces, and information retrieval," SIAM review, vol. 41, pp. 335-362, 1999.

M. E. Maron and J. L. Kuhns, "On relevance, probabilistic indexing and information retrieval," Journal of the ACM (JACM), vol. 7, pp. 216-244, 1960.

G. Bordogna and G. Pasi, "A fuzzy linguistic approach generalizing boolean information retrieval: A model and its evaluation," JASIS, vol. 44, pp. 70-82, 1993.

S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman, "Indexing by latent semantic analysis," JASIS, vol. 41, pp. 391-407, 1990.

E. Wiener, J. O. Pedersen, and A. S. Weigend, "A neural network approach to topic spotting," in Proceedings of SDAIR-95, 4th annual symposium on document analysis and information retrieval, 1995, pp. 317-332.

H. Schütze, D. A. Hull, and J. O. Pedersen, "A comparison of classifiers and document representations for the routing problem," in Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, 1995, pp. 229-237.

L. Hao, R. Fei, and Z. Wanli, "The Preliminary Process of Modeling in Deep Web Information Fusion System," in Information Technology and Applications, 2009. IFITA '09. International Forum on, 2009, pp. 723-726.

G. Martinez, "Clasificación mediante Conjuntos," Tesis Doctoral, Departamento de Ingeniería Informática, Universidad Autónoma de Madrid, 2006.

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification: John Wiley & Sons, 2012.

R. Caruana and A. Niculescu-Mizil, "An empirical comparison of supervised learning algorithms," presented at the Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania, 2006.

M. A. Sovierzoski, F. I. M. Argoud, and F. M. de Azevedo, "Evaluation of ANN Classifiers During Supervised Training with ROC Analysis and Cross Validation," in BioMedical Engineering and Informatics, 2008. BMEI 2008. International Conference on, 2008, pp. 274-278.

A. Khemphila and V. Boonjing, "Comparing performances of logistic regression, decision trees, and neural networks for classifying heart disease patients," in Computer Information Systems and Industrial Management Applications (CISIM), 2010 International Conference on, 2010, pp. 193-198.

R. S. Feldman, J, The Text Mining Handbook. New York: Cambridge University Press, 2006.

Z. Lijuan, W. Linshuang, G. Xuebin, and S. Qian, "A clustering-Based KNN improved algorithm CLKNN for text classification," in Informatics in Control, Automation and Robotics (CAR), 2010 2nd International Asia Conference on, 2010, pp. 212-215.

Google. (2011, 10/12/2012). Guía para principiantes sobre optimización para motores de búsqueda. Available: https://www.google.es/webmasters/docs/guia_optimizacion_motores_busqueda.pdf

Z. Chengling, L. Jiaojiao, and D. Fengfeng, "Application and Research of SEO in the Development of Web2.0 Site," in Knowledge Acquisition and Modeling, 2009. KAM '09. Second International Symposium on, 2009, pp. 236-238.

D. Wu, T. Luan, Y. Bai, L. Wei, and Y. Li, "Study on SEO monitoring system based on keywords and links," in Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on, 2010, pp. 450-453.

S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Proceedings of the Seventh World-Wide Web Conference, 1998.

P. Gupta and K. Johari, "Implementation of Web Crawler," in Emerging Trends in Engineering and Technology (ICETET), 2009 2nd International Conference on, 2009, pp. 838-843.

M. Abdeen and M. F. Tolba, "Challenges and design issues of an Arabic web crawler," in Computer Engineering and Systems (ICCES), 2010 International Conference on, 2010, pp. 203-206.

S. Yang, I. G. Councill, and C. L. Giles, "The Ethicality of Web Crawlers," in Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, 2010, pp. 668-675.

L. Van Wel and L. Royakkers, "Ethical issues in web data mining," Ethics and Inf. Technol., vol. 6, pp. 129-140, 2004.

C. Olston and M. Najork, "Web Crawling," Foundations and Trends in Information Retrieval, vol. 4, pp. 175-246, 2010.

Y. Y. Yuekui, Du; Yufeng, Hai; Zhaoqiong, Gao, "A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree," in Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on, 2009, pp. 420-423.

H. Rui, L. Fen, and S. Zhongzhi, "Focused Crawling with Heterogeneous Semantic Information," in Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference on, 2008, pp. 525-531.

P. Boldi, B. Codenotti, M. Santini, and S. Vigna, "UbiCrawler: a scalable fully distributed Web crawler," Software: Practice and Experience, vol. 34, pp. 711-726, 2009.

J. M. Exposto, J. , A. Pina, A. Alves, and J. Rufino. (2005, Geographical Partition for Distributed Web Crawling. GIR ’05: Proc. of the Geographic Information Retrieval, 55–60.

H. Jinzhu, Z. Xing, S. Jiangbo, X. Chunxiu, and Z. Jun, "Research of Active Information Service System Based on Intelligent Agent," in Education Technology and Computer Science, 2009. ETCS '09. First International Workshop on, 2009, pp. 837-841.

Y. Guojun, X. Xiaoyao, and L. Zhijie, "The design and realization of open-source search engine based on Nutch," in Anti-Counterfeiting Security and Identification in Communication (ASID), 2010 International Conference on, 2010, pp. 176-180.

A. G. Ardo, Koraljka, "Documentation for the Combine (focused) crawling system," 2009.

R. C. Baeza-Yates, Carlos, "WIRE: an Open-Source Web Information Retrieval Environment," Workshop on Open Source Web Information Retrieval (OSWIR), pp. 27-30, Compiegne, France 2005.

M. Gray. (1993). Wanderer. Growth and Usage of the Web and the Internet. Available: http://www.mit.edu/people/mkgray/growth/

T. Seymour, D. Frantsvog, and S. Kumar, "History Of Search Engines," International Journal of Management & Information Systems – Fourth Quarter 2011, vol. 15, pp. 47-58, 2011.

D. Eichmann, "The RBSE spider - Balancing effective search against web load," in Proceedings of the First International World Wide Web Conference, Ginebra- Suiza, 1994.

B. Pinkerton, "Finding what people want: Experiences with the WebCrawler," in Proceedings of the 2nd International World Wide Web Conference, 1994.

O. McBryan, "GENVL and WWWW: Tools for taming the web," in Proceedings of the First International World Wide Web Conference, Ginebra- Suiza, 1994.

R. Fielding, "Maintaining distributed hypertext infostructures: Welcome to MOMspider’s web," in Proceedings of the First International World Wide Web Conference, Ginebra- Suiza, 1994.

J. Cho, H. Garcia-Molina, and P. Lawrence, "Efficient crawling through URL ordering," Proceedings of the seventh international conference on World Wide Web 7 (WWW7), Amsterdam, The Netherlands, pp. 161-172, 1998.

D. Zeinalipour-Yazti and M. D. Dikaiakos, "Design and Implementation of a Distributed Crawler and Filtering Processor," presented at the Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems, 2002.

P. Boldi, B. Codenotti, M. Santini, and S. Vigna, "Ubicrawler: A scalable fully distributed web crawler," Software: Practice and Experience, vol. 34, pp. 711-726, 2004.

P. S. Boldi, Massimo; Vigna, Sebastiano, "Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations," Algorithms and Models for the Web-Graph, pp. 168-180, 2009.

A. Del Coso Santos, "Desarrollo de infraestructuras para el modelado de usuarios," Universidad Carlos III de Madrid, 2009.

P. Tadapak, T. Suebchua, and A. Rungsawang, "A Machine Learning Based Language Specific Web Site Crawler," in Network-Based Information Systems (NBiS), 2010 13th International Conference on, 2010, pp. 155-161.

Q. Shaojie, L. Tianrui, L. Hong, Z. Yan, P. Jing, and Q. Jiangtao, "SimRank: A Page Rank approach based on similarity measure," in Intelligent Systems and Knowledge Engineering (ISKE), 2010 International Conference on, 2010, pp. 390-395.

M. A. Qureshi, A. Younus, and F. Rojas, "Analyzing the Web Crawler as a Feed Forward Engine for an Efficient Solution to the Search Problem in the Minimum Amount of Time through a Distributed Framework," in Information Science and Applications (ICISA), 2010 International Conference on, 2010, pp. 1-8.

J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," in Proceedings of the Sixth Symposium on Operating Systems Design and Implementation, San Francisco, California, 2004, pp. 137-150.

Z. Ming-sheng, Z. Peng, and H. Tian-chi, "An Intelligent Topic Web Crawler Based on DTB," in Web Information Systems and Mining (WISM), 2010 International Conference on, 2010, pp. 84-86.

S. Anbukodi and K. M. Manickam, "Reducing web crawler overhead using mobile crawler," in Emerging Trends in Electrical and Computer Technology (ICETECT), 2011 International Conference on, 2011, pp. 926-932.

S. Chakrabarti, M. Van den Berg, and B. Dom, "Focused crawling: a new approach to topic-specific Web resource discovery," Computer Networks, vol. 31, pp. 1623-1640, 1999.

L. Peng, W. Xiao Long, G. Yi, and Z. Yu Ming, "Extracting answers to natural language questions from large-scale corpus," in Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE '05. Proceedings of 2005 IEEE International Conference on, 2005, pp. 690-694.

W. Wenxian, C. Xingshu, Z. Yongbin, W. Haizhou, and D. Zongkun, "A Focused Crawler Based on Naive Bayes Classifier," in Intelligent Information Technology and Security Informatics (IITSI), 2010 Third International Symposium on, 2010, pp. 517-521.

D. Mukhopadhyay, A. Biswas, and S. Sinha, "A New Approach to Design Domain Specific Ontology Based Web Crawler," in Information Technology, (ICIT 2007). 10th International Conference on, 2007, pp. 289-291.

Z. Qiang, "An Algorithm OFC for the Focused Web Crawler," in Machine Learning and Cybernetics, 2007 International Conference on, 2007, pp. 4059-4063.